Computer Science ›› 2025, Vol. 52 ›› Issue (11): 141-149.doi: 10.11896/jsjkx.240900113

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Human-Object Interaction Detection Based on Fine-grained Attention Mechanism

DING Yuanbo, BAI Lin, LI Taoshen   

  1. School of Computer and Electronic Information,Guangxi University,Nanning 530004,China
  • Received:2024-09-18 Revised:2024-12-02 Online:2025-11-15 Published:2025-11-06
  • About author:DING Yuanbo,born in 2000,postgra-duate.His main research interest is re-cognizing human-object interactionactions.
    BAI Lin,born in 1985,associate professor,postgraduate supervisor,is a member of CCF(No.A6951M).His main research interests include deep learning and computer vision.
  • Supported by:
    National Natural Science Foundation of China(61966003).

Abstract: Fine-grained information,as a kind of contextual information,can assist models in recognizing human-object interactions with similar relative spatial relationships.However,how to utilize this key cue to uniformly model feature information of different granularities on multi-scale feature maps remains a critical challenge that hinder further improvement of human-object interaction detection accuracy.To address this problem,this paper proposes a human-object interaction detection model based on fine-grained attention mechanism.The model strengthens local features under the guidance of fine-grained information.It fuses feature maps of different scales and automatically learns image content through a deformable attention mechanism.Additionally,it models the long-range dependencies between features of various granularities,essentially improving the accuracy of the human-object interaction detection model.Extensive experiments are conducted on the V-COCO and HICO datasets.The experimental results show that the proposed method has increased the mAPby 7.7 percentage points on the V-COCO dataset,and the mAP has increased by 7.43,7.5 and 7.85 percentage points on the HICO dataset compared to the baseline models.

Key words: Deep learning, Human-Object interaction detection, Fine-grained information, Attention mechanism

CLC Number: 

  • TP391
[1]GUPTA S,MALIK J.Visual semantic role labeling[J].arXiv:1505.04474,2015.
[2]SADEGHI M A,FARHADI A.Recognition using visual phrases[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.IEEE,2011:1745-1752.
[3]WAN B,ZHOU D,LIU Y,et al.Pose-aware multi-level feature network for human object interaction detection[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2019:9469-9478.
[4]LI Y L,ZHOU S,HUANG X,et al.Transferable interactive-ness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3585-3594.
[5]YAN Z X,BAI L,LI T S.Lightweight human pose estimation based on self-knowledge distillation and convolution compression[J].Journal of Chinese Computer Systems,2024,45(2):461-469.
[6]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington:IEEE Computer Society,2005:886-893.
[7]LOWE D G.Distinctive image features from scale- invariantkeypoints[J].International Journal of Computer Vision,2004,60(2):91-110.
[8]GAO C,ZOU Y,HUANG J B.ican:Instance centric attentionnetwork for human-object interaction detect-ion[J].arXiv:1808.10437,2018.
[9]GKIOXARI G,GIRSHICK R,DOLLÁR P,et al.Detecting and recognizing human-object interactions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8359-8367.
[10]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015.
[11]LI B Z,ZHANG J,WANG B L,et al.Human-Object Interaction Recognition Integrating Multi-level Visual Features [J].Computer Science,2022,49(S2):643-650.
[12]LIN X,ZOU Q,XU X.Action-guided attention mining and relation reasoning network for human-object interaction detection[C]//Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence.2021:1104-1110.
[13]SUN X,HU X,REN T,et al.Human object interaction detection via multi-level conditioned network[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:26-34.
[14]QI S,WANG W,JIA B,et al.Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:401-417.
[15]WANG H,ZHENG W,YINGBIAO L.Contextual heterogeneous graph network for human-object interaction detection[C]//Computer Vision-ECCV 2020:16th European Conference.Cham:Springer,2020:248-264.
[16]ULUTAN O,IFTEKHAR A S M,MANJUNATH B S.Vsg-net:Spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13617-13626.
[17]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017.
[18]ZOU C,WANG B,HU Y,et al.End-to-end human object interaction detection with hoi transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:11825-11834.
[19]TAMURA M,OHASHI H,YOSHINAGA T.Qpic:Query-based pairwise human-object interaction detection with image-wide contextual informa-tion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10410-10419.
[20]ZHOU P,CHI M.Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:843-851.
[21]LIU H,MU T J,HUANG X.Detecting human—object interaction with multi-level pairwise feature network[J].Computa-tional Visual Media,2021,7:229-239.
[22]CHEN M,LIAO Y,LIU S,et al.Reformulating hoi detection as adaptive set prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:9004-9013.
[23]KIM B,LEE J,KANG J,et al.Hotr:End-to-end human-object interaction detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:74-83.
[24]PARKJ,PARK J W,LEE J S.Viplo:Vision transformer based pose-conditioned self-loop graph for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:17152-17162.
[25]MA S,WANG Y,WANG S,et al.Fgahoi:Fine-grained anchors for human-object interaction detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(4):2415-2429.
[26]WU M,GU J,SHEN Y,et al.End-to-end zero-shot hoi detection via vision and language knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:2839-2846.
[27]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[1] YIN Shi, SHI Zhenyang, WU Menglin, CAI Jinyan, YU De. Deep Learning-based Kidney Segmentation in Ultrasound Imaging:Current Trends and Challenges [J]. Computer Science, 2025, 52(9): 16-24.
[2] ZENG Lili, XIA Jianan, LI Shaowen, JING Maike, ZHAO Huihui, ZHOU Xuezhong. M2T-Net:Cross-task Transfer Learning Tongue Diagnosis Method Based on Multi-source Data [J]. Computer Science, 2025, 52(9): 47-53.
[3] LI Yaru, WANG Qianqian, CHE Chao, ZHU Deheng. Graph-based Compound-Protein Interaction Prediction with Drug Substructures and Protein 3D Information [J]. Computer Science, 2025, 52(9): 71-79.
[4] LUO Chi, LU Lingyun, LIU Fei. Partial Differential Equation Solving Method Based on Locally Enhanced Fourier NeuralOperators [J]. Computer Science, 2025, 52(9): 144-151.
[5] LIU Leyuan, CHEN Gege, WU Wei, WANG Yong, ZHOU Fan. Survey of Data Classification and Grading Studies [J]. Computer Science, 2025, 52(9): 195-211.
[6] LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[7] PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching [J]. Computer Science, 2025, 52(9): 276-281.
[8] GAO Long, LI Yang, WANG Suge. Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation [J]. Computer Science, 2025, 52(9): 313-319.
[9] LIU Zhengyu, ZHANG Fan, QI Xiaofeng, GAO Yanzhao, SONG Yijing, FAN Wang. Review of Research on Deep Learning Compiler [J]. Computer Science, 2025, 52(8): 29-44.
[10] TANG Boyuan, LI Qi. Review on Application of Spatial-Temporal Graph Neural Network in PM2.5 ConcentrationForecasting [J]. Computer Science, 2025, 52(8): 71-85.
[11] LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[12] LIU Yajun, JI Qingge. Pedestrian Trajectory Prediction Based on Motion Patterns and Time-Frequency Domain Fusion [J]. Computer Science, 2025, 52(7): 92-102.
[13] LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
[14] ZHUANG Jianjun, WAN Li. SCF U2-Net:Lightweight U2-Net Improved Method for Breast Ultrasound Lesion SegmentationCombined with Fuzzy Logic [J]. Computer Science, 2025, 52(7): 161-169.
[15] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!