计算机科学 ›› 2026, Vol. 53 ›› Issue (2): 227-235.doi: 10.11896/jsjkx.241200082

• 计算机图形学&多媒体 • 上一篇    下一篇

多模态水声图像目标视觉检测

黄靖1,2, 王腾1, 刘健1, 胡凯1, 彭鑫1, 黄亚敏3,4, 文元桥3,4   

  1. 1 武汉理工大学计算机与人工智能学院 武汉 430070
    2 浙江省交通运输科学研究院新一代人工智能技术应用交通运输行业研发中心 杭州 310013
    3 国家水运安全工程技术研究中心 武汉 430063
    4 武汉理工大学智能交通系统研究中心 武汉 430063
  • 收稿日期:2024-12-10 修回日期:2025-03-31 发布日期:2026-02-10
  • 通讯作者: 黄靖(huangjing@whut.edu.cn)
  • 基金资助:
    新一代人工智能技术应用交通运输行业研发中心开放基金(202302H);浙江省交通厅科技项目(2024006);国家自然科学基金(52072287)

Multimodal Visual Detection for Underwater Sonar Target Images

HUANG Jing1,2, WANG Teng1, LIU Jian1, HU Kai1, PENG Xin1, HUANG Yamin3,4, WEN Yuanqiao3,4   

  1. 1 School of Computer Science and Artificial Intelligence,Wuhan University of Technology,Wuhan 430070,China
    2 R&D Center of New-Generation Artificial Intelligence Technology Application for Transportation Industry,Zhejiang Provincial Institute of Transportation Science,Hangzhou 310013,China
    3 National Engineering Research Center for Water Transport Safety,Wuhan 430063,China
    4 Intelligent Transportation Systems Research Center,Wuhan University of Technology,Wuhan 430063,China
  • Received:2024-12-10 Revised:2025-03-31 Online:2026-02-10
  • About author:HUANG Jing,born in 1977,Ph.D,associate professor,master supervisor.His main research interests include machine learning,data mining,intelligent transportation systems,pattern recognition and computer vision.
  • Supported by:
    Transportation Industry R&D Center Open Fund for New-Gen AI Applications(202302H),Science and Techno-logy Project of Zhejiang Provincial Department of Transportation(2024006) and National Natural Science Foundation of China(52072287).

摘要: 由于水声图像数据不足,水声目标的监督信息过少,现有的目标检测算法难以直接使用。为了解决此问题,在DETR(End-to-End Object Detection with Transformers)的基础上,提出了一种基于开集的水声图像目标检测方法USD(Underwater Sonar Detection)。首先,在跨模态特征融合编码模块中,使用多尺度可变形注意力机制对图像特征单独迭代,帮助网络有选择性地自动关注重要信息,减少计算量,同时采用多头自注意力机制迭代文本特征,提高模型对序列的全局建模能力;然后,使用双向注意力机制融合文本与图像特征,关注输入序列中的双向关系,使网络学习到更复杂的文本图像关系;最后,在图像文本特征解码模块中,使用Encoder模块输出的图像特征初始化query,在训练时使用DN(DeNoising)方法解决模型收敛慢的问题。实验表明,所提方法在自制的水声图像数据集上的平均检测精度达到77.5%,与其他检测方法相比具有更高的精度,同时实现了开集目标检测,具有良好的检测性能。

关键词: 深度学习, 水声图像, 开集目标检测, 特征融合, 多模态

Abstract: Due to the limited underwater sonar image data and supervisory information for underwater targets,existing object detection algorithms are challenging to apply directly.To address this issue,this paper proposes an open-set underwater sonar image object detection method,USD(Underwater Sonar Detection),based on DETR.In the cross-modal feature fusion encoding module,it employs a multi-scale deformable attention mechanism to process image features iteratively,enabling the network to selectively focus on important information while reducing computational load.Simultaneously,it designs a multi-head self-attention mechanism to iterate text features,enhancing the model’s global modeling capability for sequences.Next,it utilizes a bidirectional attention mechanism to fuse text and image features,emphasizing the bidirectional relationships within the input sequences and enabling the network to capture more complex text-image interactions.Additionally,in the image-text feature decoding module,it uses image features to initialize queries,which are output from the Encoder module,and applies the DN method to address the issue of slow model convergence during training.Experiments show that the proposed method achieves a mean average precision of 77.5% on a custom underwater sonar image dataset,outperforming other detection methods in terms of precision,meanwhile successfully implements open-set object detection with robust performance.

Key words: Deep learning, Underwater sonar images, Open-set object detection, Feature fusion, Multimodal

中图分类号: 

  • TP399
[1]GU Y S,JIANG Q P,SHAO F,et al. A quality evaluation dataset for real underwater image enhancement[J].Journal of Image and Graphics,2022,27(5):1467-1480.
[2]CHEN L,DING D D.Underwater image enhancement based on multi-residual joint learning[J].Journal of Image and Graphics,2022,27(5):1577-1588.
[3]GUO J C,YUE H H,ZHANG Y,et al.A study on the impact of image enhancement on salient object detection[J].Journal of Image and Graphics,2022,27(7):2129-2147.
[4]WANG K Y,HUANG S R,LI Y S.Research progress on underwater optical image reconstruction methods[J].Journal of Image and Graphics,2022,27(5):1337-1358.
[5]QIAN X Q,LIU W F,ZHANG J,et al.Degradation feature enhancement algorithm for underwater image object detection[J].Journal of Image and Graphics,2022,27(11):3185-3198.
[6]LIANG X M,LI R,YU H F,et al.Improved YOLOv7 algorithm for underwater object detection[J].Computer Engineering and Applications,2024,60(6):89-99.
[7]YAN X H.Research on underwater object detection methodbased on deep learning[D].Harbin:Harbin Engineering University,2021.
[8]CHEN X L.Deep learning-based underwater litter detection[D].Guizhou:Guizhou Normal University,2023.
[9]YU Y,ZHAO J,GONG Q,et al.Real-Time Underwater Maritime Object Detection in Side-Scan Sonar Images Based on Transformer-YOLOv5[J].Remote Sensing,2021,13(18):3555.
[10]GUO Y L.Research on deep learning-based underwater sonar image object detection method[D].Jinan:Shandong Jiaotong University,2023.
[11]LIANG H,JIN L L,YANG C S.Research on underwater object recognition based on deep learning under small sample conditions[J].Journal of Wuhan University of Technology(Transportation Science & Engineering Edition),2019,43(1):6-10.
[12]VARGHVARGHESE R,SAMBATH M.YOLOv8:A NovelObject Detection Algorithm with Enhanced Performance and Robustness[C]//Proceedings of IEEE International Conference on Advances in Data Engineering and Intelligent Computing Systems.New York:IEEE Press,2024:1-6.
[13]FENG J J,LI B,TIAN L F,et al.Semi-supervised surface object detection based on multi-view cross-consistency learning[J].Journal of Harbin Institute of Technology,2023,55(4):107-114.
[14]AMIN R A,HASAN M,WIESE V,et al.FPGA-Based Real-time Object Detection and Classification System Using YOLO For Edge Computing[J].IEEE Access,2024,12:73268-73278.
[15]WANG C Y,BOCHKOVSKIY A,LIAO H Y M.YOLOv7:Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors[C]//Proceedings of IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2023:7464-7475.
[16]LUO F,LI J W,HE D S.Ship object detection based on scale-adaptive receptive field[J].Application Research of Computers,2024,41(8):2521-2527.
[17]LONG Y,WEN Y,HAN J,et al.CapDet:Unifying Dense Captioning and Open-World Detection Pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2023:15233-15243.
[18]SCHEIRER W J,DE REZENDE A,SAPKOTA A,et al.TowardOpen Set Recognition[J].IEEE Transactions on Pattern Ana-lysis and Machine Intelligence,2012,35(7):1757-1772.
[19]RADFORD A,KIM J W,HALLACY C,et al.Learning Transferable Visual Models From Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning.Virtual:PMLR,2021:8748-8763.
[20]GU X,LIN T Y,KUO W,et al.Open-Vocabulary Object Detection Via Vision and Language Knowledge Distillation[J].arXiv:2104.13921,2021.
[21]ZHONG Y,YANG J,ZHANG P,et al.RegionCLIP:Region-Based Language-Image Pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:16793-16803.
[22]MINDERER M,GRITSENKO A,STONE A,et al.SimpleOpen-Vocabulary Object Detection[C]//Proceedings of European Conference on Computer Vision.Cham:Springer Nature Switzerland,2022:728-755.
[23]YAO L,HAN J,WEN Y,et al.DetCLIP:Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection[J].Advances in Neural Information Processing Systems,2022,35:9125-9138.
[24]YAO L,HAN J,LIANG X,et al.DetCLIPv2:Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Vancouver:IEEE Press,2023:23497-23506.
[25]KENTHAPADI K,SAMEKI M,TALY A.Grounding andEvaluation for Large Language Models:Practical Challenges and Lessons Learned(survey)[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM Press,2024:6523-6533.
[26]LI Z,XU Q,ZHANG D,et al.Groundinggpt:Language En-hanced Multi-modal Grounding Model[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Bangkok:Association for Computational Linguistics,2024:6657-6678.
[27]LI L H,ZHANG P,ZHANG H,et al.Grounded language-image pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2022:10965-10975.
[28]LIU S,ZENG Z,REN T,et al.Grounding DINO:Marrying DINO with Grounded Pre-training for Open-Set Object Detection[C]//Proceedings of the European Conference on Computer Vision.Cham:Springer,2025:38-55.
[29]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End Object Detection with Transformers[C]//Proceedings of the European Conference on Computer Vision.Cham:Springer International Publishing,2020:213-229.
[30]MENG D,CHEN X,FAN Z,et al.Conditional DETR for Fast Training Convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Montreal:IEEE Press,2021:3651-3660.
[31]LI F,ZHANG H,LIU S,et al.DN-DETR:Accelerate DETRTraining by Introducing Query DeNoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:13619-13627.
[32]WANG Y,ZHANG X,YANG T,et al.Anchor DETR:QueryDesign for Transformer-Based Object Detection[J].arXiv:2109.07107,2021.
[33]ZHANG H,LI F,LIU S,et al.Dino:Detr withImproved Denoi-sing Anchor Boxes for End-to-End Object Detection[J].arXiv:2203.03605,2022.
[34]REDMON J.You Only Look Once:Unified,Real-time ObjectDetection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016.
[35]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).Minneapolis:Association for Computational Linguistics,2019:1-2.
[36]LU J,BATRA D,PARIKH D,et al.VILBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Proceedings og the 33rd International Conference on Neural Information Processing Systems.2019,13-23.
[37]ZHU X,SU W,LU L,et al.Deformable Detr:DeformableTransformers for End-to-End Object Detection[J].arXiv:2010.04159,2020.
[38]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towardsreal-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[39]CHEN Q,CHEN X,WANG J,et al.Group DETR:Fast DETR Training with Group-Wise One-to-Many Assignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Paris:IEEE Press,2023:6633-6642.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!