计算机科学 ›› 2025, Vol. 52 ›› Issue (11): 141-149.doi: 10.11896/jsjkx.240900113

• 计算机图形学&多媒体 • 上一篇    下一篇

基于细粒度注意力机制的人与物体交互检测

丁元博, 白琳, 李陶深   

  1. 广西大学计算机与电子信息学院 南宁 530004
  • 收稿日期:2024-09-18 修回日期:2024-12-02 出版日期:2025-11-15 发布日期:2025-11-06
  • 通讯作者: 白琳(bailin@gxu.edu.cn)
  • 作者简介:(2603491489@qq.com)
  • 基金资助:
    国家自然科学基金(61966003)

Human-Object Interaction Detection Based on Fine-grained Attention Mechanism

DING Yuanbo, BAI Lin, LI Taoshen   

  1. School of Computer and Electronic Information,Guangxi University,Nanning 530004,China
  • Received:2024-09-18 Revised:2024-12-02 Online:2025-11-15 Published:2025-11-06
  • About author:DING Yuanbo,born in 2000,postgra-duate.His main research interest is re-cognizing human-object interactionactions.
    BAI Lin,born in 1985,associate professor,postgraduate supervisor,is a member of CCF(No.A6951M).His main research interests include deep learning and computer vision.
  • Supported by:
    National Natural Science Foundation of China(61966003).

摘要: 细粒度信息作为一种上下文信息,能够辅助模型识别相对空间关系相似的人与物体交互动作。然而,如何利用这一关键线索统一建模多尺度特征图上不同粒度的特征信息,仍然是人与物体交互检测精度进一步提升面临的主要挑战之一。为了解决这一问题,提出了一种基于细粒度注意力机制的人与物体交互检测模型(FGDHOI)。该模型在细粒度信息的指导下强化局部特征,融合不同尺度的特征图,通过可变形注意力机制自动学习图像内容,并建模不同粒度特征之间的长距离依赖关系,从本质上提高了人与物体交互检测模型的精度。在V-COCO和HICO数据集上进行了广泛的定性、定量及消融实验。实验结果表明,所提出的方法相比基准模型,在V-COCO数据集上mAP提升了7.7个百分点,在HICO数据集3项指标上mAP分别提升了7.43个百分点、7.5个百分点和7.85个百分点。

关键词: 深度学习, 人与物体交互检测, 细粒度信息, 注意力机制

Abstract: Fine-grained information,as a kind of contextual information,can assist models in recognizing human-object interactions with similar relative spatial relationships.However,how to utilize this key cue to uniformly model feature information of different granularities on multi-scale feature maps remains a critical challenge that hinder further improvement of human-object interaction detection accuracy.To address this problem,this paper proposes a human-object interaction detection model based on fine-grained attention mechanism.The model strengthens local features under the guidance of fine-grained information.It fuses feature maps of different scales and automatically learns image content through a deformable attention mechanism.Additionally,it models the long-range dependencies between features of various granularities,essentially improving the accuracy of the human-object interaction detection model.Extensive experiments are conducted on the V-COCO and HICO datasets.The experimental results show that the proposed method has increased the mAPby 7.7 percentage points on the V-COCO dataset,and the mAP has increased by 7.43,7.5 and 7.85 percentage points on the HICO dataset compared to the baseline models.

Key words: Deep learning, Human-Object interaction detection, Fine-grained information, Attention mechanism

中图分类号: 

  • TP391
[1]GUPTA S,MALIK J.Visual semantic role labeling[J].arXiv:1505.04474,2015.
[2]SADEGHI M A,FARHADI A.Recognition using visual phrases[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.IEEE,2011:1745-1752.
[3]WAN B,ZHOU D,LIU Y,et al.Pose-aware multi-level feature network for human object interaction detection[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2019:9469-9478.
[4]LI Y L,ZHOU S,HUANG X,et al.Transferable interactive-ness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3585-3594.
[5]YAN Z X,BAI L,LI T S.Lightweight human pose estimation based on self-knowledge distillation and convolution compression[J].Journal of Chinese Computer Systems,2024,45(2):461-469.
[6]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington:IEEE Computer Society,2005:886-893.
[7]LOWE D G.Distinctive image features from scale- invariantkeypoints[J].International Journal of Computer Vision,2004,60(2):91-110.
[8]GAO C,ZOU Y,HUANG J B.ican:Instance centric attentionnetwork for human-object interaction detect-ion[J].arXiv:1808.10437,2018.
[9]GKIOXARI G,GIRSHICK R,DOLLÁR P,et al.Detecting and recognizing human-object interactions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8359-8367.
[10]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015.
[11]LI B Z,ZHANG J,WANG B L,et al.Human-Object Interaction Recognition Integrating Multi-level Visual Features [J].Computer Science,2022,49(S2):643-650.
[12]LIN X,ZOU Q,XU X.Action-guided attention mining and relation reasoning network for human-object interaction detection[C]//Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence.2021:1104-1110.
[13]SUN X,HU X,REN T,et al.Human object interaction detection via multi-level conditioned network[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:26-34.
[14]QI S,WANG W,JIA B,et al.Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:401-417.
[15]WANG H,ZHENG W,YINGBIAO L.Contextual heterogeneous graph network for human-object interaction detection[C]//Computer Vision-ECCV 2020:16th European Conference.Cham:Springer,2020:248-264.
[16]ULUTAN O,IFTEKHAR A S M,MANJUNATH B S.Vsg-net:Spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13617-13626.
[17]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017.
[18]ZOU C,WANG B,HU Y,et al.End-to-end human object interaction detection with hoi transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:11825-11834.
[19]TAMURA M,OHASHI H,YOSHINAGA T.Qpic:Query-based pairwise human-object interaction detection with image-wide contextual informa-tion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10410-10419.
[20]ZHOU P,CHI M.Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:843-851.
[21]LIU H,MU T J,HUANG X.Detecting human—object interaction with multi-level pairwise feature network[J].Computa-tional Visual Media,2021,7:229-239.
[22]CHEN M,LIAO Y,LIU S,et al.Reformulating hoi detection as adaptive set prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:9004-9013.
[23]KIM B,LEE J,KANG J,et al.Hotr:End-to-end human-object interaction detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:74-83.
[24]PARKJ,PARK J W,LEE J S.Viplo:Vision transformer based pose-conditioned self-loop graph for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:17152-17162.
[25]MA S,WANG Y,WANG S,et al.Fgahoi:Fine-grained anchors for human-object interaction detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(4):2415-2429.
[26]WU M,GU J,SHEN Y,et al.End-to-end zero-shot hoi detection via vision and language knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:2839-2846.
[27]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!