Computer Science ›› 2026, Vol. 53 ›› Issue (6A): 250700173-8.doi: 10.11896/jsjkx.250700173

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

RGB-IR Multi-modal Fusion-based Tomato Small Object Detection

DONG Ye1, LIAN Xinyue1, WANG Yuyang1, OU Xinyu1,2   

  1. 1 School of Information,Yunnan University of Finance and Economics,Kunming 650600,China
    2 Yunnan Key Laboratory of Service Computing,Kunming 650221,China
  • Online:2026-06-16 Published:2026-06-12
  • About author:DONG Ye,born in 2004,undergraduate.Her main research interests include computer vision, image recognition and deep learning algorithms for practical applications.
    OU Xinyu,born in 1982,Ph.D,professor.His main research interests include computer vision,multi-modality,and federated learning.
  • Supported by:
    National Natural Science Foundation of China(62462065),Ministry of Education Humanities and Social Sciences Research Project(25YJCZH220) and Yunnan University of Finance and Economics Yuncai Scholar Project(2024D45).

Abstract: Automated tomato harvesting plays a pivotal role in enhancing agricultural efficiency and ensuring produce quality,but faces significant challenges in complex orchard environments where single-modal systems often fall short.Variabilities in lighting,occlusions by foliage,and the subtle characteristics of small targets in RGB images,coupled with inadequate feature extraction from single-sensor data,significantly impede precise detection.This study introduces the robust tomato detection with multi-modal fusion(RTDMF) model,designed to address these limitations by integrating RGB and infrared(IR) imaging technologies to bolster detection robustness.Constructed on the YOLOv5 framework,RTDMF incorporates lightweight depthwise separable convolutions and adaptive anchor boxes to enhance sensitivity towards small targets.The dual-branch architecture of RTDMF processes RGB and IR data independently,fusing color,texture,and thermal features effectively through self-attention mechanisms and specialized fusion modules.Furthermore,Mosaic data augmentation and dynamic learning rate strategies are employed to further enhance the model's generalization and convergence capabilities.Evaluated on a multimodal tomato dataset encompassing varying levels of maturity,diverse lighting conditions,and occlusion scenarios,RTDMF demonstrates a notable improvement,it achieves a 9.7% increase in mean average precision(mAP) and a0.6% higher recall rate compared to single-modal models.It also significantly reduces the miss and false detection rates by 2.3% and 3.1%,respectively.Visual analysis confirms the model's effectiveness in low-contrast and heavily occluded scenarios,showcasing its superior adaptability to real-world agricultural challenges.This multi-modal approach delivers a robust solution for automated harvesting systems in dynamic environments,marking a significant advancement in the field of agricultural automation.

Key words: Tomato small object detection, Multi-modal fusion, Adaptive anchor box optimization, Dual-branch architecture, Target feature refinement

CLC Number: 

  • TP391
[1] RAM P P V S,YASWANTH K V S,KAMEPALLI S S B S,et al.Deep learning model YOLOv5 for red chilies detection from chilly crop images[C]//Proc of the 8th IEEE Int Conf on Technology Fusion(I2CT).IEEE,Piscataway,NJ,2023.
[2] ZHANG Y,RAO Y,CHEN W J,et al.Multimodal image dataset of tomato fruits with different maturity[DB/OL].Science Data Bank.Elsevier,2023.
[3] LI S Q,TANG L,LIU K Y,et al.A fast and adaptive objecttracking method[J].Journal of Computer Research and Deve-lopment,2012,49(2):383-391.
[4] XU S F,CHEN X,LI H W,et al.Airborne small target detection method based on multimodal and adaptive feature fusion[J].IEEE Transactions on Aerospace and Electronic Systems,2024,62(2):1-15.
[5] ZHAO J M,SHI Z L,YU C,et al.Infrared small target detection based on adjustable sensitivity strategy and multi-scale fusion[J].Infrared Physics & Technology,2024,2407(20090):1-14.
[6] BADRINARAYANAN V,KENDALL A,CIPOLLA R.Seg-Net:A deep convolutional encoder-decoder architecture for image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(12):2481-2495.
[7] NIU S Q,XU X L,LIANG A,et al.Research on a lightweight method for maize seed quality detection based on improved YOLOv8[J].IEEE Access,2024,12:32927-32937.
[8] WANG C Y,BOCHKOVSKIY A,LIAO H Y.YOLOv7:Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proc of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2021).IEEE,Vancouver,BC,Canada,2023.
[9] SHANG Y Y,ZHANG Q R,SONG H B.Application of deep learning using YOLOv5s to apple flower detection in natural scenes[J].Transactions of the Chinese Society of Agricultural Engineering(Transactions of the CSAE),2022,38(9):222-229.
[10] OMER S M,GHAFOOR K,ASKAR S.Lightweight improvedthe YOLOv5 model for cucumber leaf disease and pest detection based on deep learning.Computers and Electronics in Agriculture[J].Signal,Image and Video Processing,2024,18:1329-1342.
[11] WANG F Z,WANG P,ZHANG X,et al.An overview of parametric modeling and methods for radar target detection with limited data[J].IEEE Access,2021,9:60459-60469.
[12] LIU L B,CHEN J Q,WU H F,et al.Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2021:479-488.
[13] XIE J,NIE J,DING B A,et al.Cross-Modal Local Calibration and Global Context Modeling Network for RGB-Infrared Remote-Sensing Object Detection[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,2023,16:8933-8942.
[14] MEES O,EITEL A,BURGARD W.Choosing smartly:Adaptive multimodal fusion for object detection in changing environments[C]//Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).Daejeon,Korea(South):IEEE,2016:775-9048.
[15] TANG H,LI Z C,ZHANG D,et al.Divide-and-Conquer:Confluent Triple-Flow Network for RGB-T Salient Object Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2025,47(3):1958-1974.
[16] TIAN H K,SONG K C,TONG L,et al.Robot Unknown Objects Instance Segmentation Based on Collaborative Weight Assignment RGB-Depth Fusion[J].IEEE/ASME Transactions on Mechatronics,2024,29(3):2032-2043.
[17] WANG M C,WANG H,LI Y C,et al.MSAFusion:Object Detection Based on Multisensor Adaptive Fusion Under BEV[J]. IEEE Transactions on Instrumentation and Measurement,2025,74:9509212.
[18] WANG L C,LI C X.Research on multimodal fusion object detection algorithm of image design and computational vision[C]// International Proc of 2024 5th International Conference on Computer Vision,Image and Deep Learning(CVIDL).2024:19-21.
[19] CHEN J,REN H Z,YANG H T,et al.M2FNet:Multi-modal fusion network for object detection from visible and thermal infrared images[J].International Journal of Applied Earth Observation and Geoinformation,2024,130(103918):1-16.
[20] YUN S D,HAN D Y,OH S J,et al.CutMix:Regularizationstrategy to train strong classifiers with localizable features[C]//Proc of IEEE/CVF CVPR.2019:10.27-11.02:612.
[21] ZHONG Z,LIANG Z,KANG G L,et al,Random Erasing data augmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:13001-13008.
[22] DAI H C,WEI X Y,XU Y X,et al.Multimodal fusion detection method for tomato fruits based on phase and hyperspectral imaging[J].Acta Photonica Sinica,2024,53(1):1-10.
[23] LI R J,SONG T,GAO J,et al.Tomato diseased leaf detection model in natural environment based on improved YOLOv5[J].Jiangsu Journal of Agricultural Sciences,2024,40(6):1028-1037.
[1] XU Cheng, LIU Yuxuan, WANG Xin, ZHANG Cheng, YAO Dengfeng, YUAN Jiazheng. Review of Speech Disorder Assessment Methods Driven by Large Language Models [J]. Computer Science, 2026, 53(3): 307-320.
[2] DU Jiantong, GUAN Zeli, XUE Zhe. Multi-task Learning-based Ophthalmic Video Feature Fusion and Multi-dimensional Profiling [J]. Computer Science, 2026, 53(3): 383-391.
[3] SHANG Yunxian, CAI Guoyong, LIU Qinghua, JIANG Yiming. Active Learning-based Multi-modal Fusion Rumor Detection [J]. Computer Science, 2025, 52(12): 391-399.
[4] HUANG Xiaofei, GUO Weibin. Multi-modal Fusion Method Based on Dual Encoders [J]. Computer Science, 2024, 51(9): 207-213.
[5] HE Shiyang, WANG Zhaohui, GONG Shengrong, ZHONG Shan. Cross-modal Information Filtering-based Networks for Visual Question Answering [J]. Computer Science, 2024, 51(5): 85-91.
[6] WU A-ming, JIANG Pin, HAN Ya-hong. Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J]. Computer Science, 2021, 48(3): 71-78.
[7] WANG Shu-hui, YAN Xu, HUANG Qing-ming. Overview of Research on Cross-media Analysis and Reasoning Technology [J]. Computer Science, 2021, 48(3): 79-86.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!