基于细粒度注意力机制的人与物体交互检测

doi:10.11896/jsjkx.240900113

Abstract

Abstract: Fine-grained information,as a kind of contextual information,can assist models in recognizing human-object interactions with similar relative spatial relationships.However,how to utilize this key cue to uniformly model feature information of different granularities on multi-scale feature maps remains a critical challenge that hinder further improvement of human-object interaction detection accuracy.To address this problem,this paper proposes a human-object interaction detection model based on fine-grained attention mechanism.The model strengthens local features under the guidance of fine-grained information.It fuses feature maps of different scales and automatically learns image content through a deformable attention mechanism.Additionally,it models the long-range dependencies between features of various granularities,essentially improving the accuracy of the human-object interaction detection model.Extensive experiments are conducted on the V-COCO and HICO datasets.The experimental results show that the proposed method has increased the mAPby 7.7 percentage points on the V-COCO dataset,and the mAP has increased by 7.43,7.5 and 7.85 percentage points on the HICO dataset compared to the baseline models.

Key words: Deep learning, Human-Object interaction detection, Fine-grained information, Attention mechanism

CLC Number:

TP391

DING Yuanbo, BAI Lin, LI Taoshen. Human-Object Interaction Detection Based on Fine-grained Attention Mechanism[J].Computer Science, 2025, 52(11): 141-149.

References

[1]GUPTA S,MALIK J.Visual semantic role labeling[J].arXiv:1505.04474,2015.
[2]SADEGHI M A,FARHADI A.Recognition using visual phrases[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.IEEE,2011:1745-1752.
[3]WAN B,ZHOU D,LIU Y,et al.Pose-aware multi-level feature network for human object interaction detection[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2019:9469-9478.
[4]LI Y L,ZHOU S,HUANG X,et al.Transferable interactive-ness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3585-3594.
[5]YAN Z X,BAI L,LI T S.Lightweight human pose estimation based on self-knowledge distillation and convolution compression[J].Journal of Chinese Computer Systems,2024,45(2):461-469.
[6]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington:IEEE Computer Society,2005:886-893.
[7]LOWE D G.Distinctive image features from scale- invariantkeypoints[J].International Journal of Computer Vision,2004,60(2):91-110.
[8]GAO C,ZOU Y,HUANG J B.ican:Instance centric attentionnetwork for human-object interaction detect-ion[J].arXiv:1808.10437,2018.
[9]GKIOXARI G,GIRSHICK R,DOLLÁR P,et al.Detecting and recognizing human-object interactions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8359-8367.
[10]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015.
[11]LI B Z,ZHANG J,WANG B L,et al.Human-Object Interaction Recognition Integrating Multi-level Visual Features [J].Computer Science,2022,49(S2):643-650.
[12]LIN X,ZOU Q,XU X.Action-guided attention mining and relation reasoning network for human-object interaction detection[C]//Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence.2021:1104-1110.
[13]SUN X,HU X,REN T,et al.Human object interaction detection via multi-level conditioned network[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:26-34.
[14]QI S,WANG W,JIA B,et al.Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:401-417.
[15]WANG H,ZHENG W,YINGBIAO L.Contextual heterogeneous graph network for human-object interaction detection[C]//Computer Vision－ECCV 2020:16th European Conference.Cham:Springer,2020:248-264.
[16]ULUTAN O,IFTEKHAR A S M,MANJUNATH B S.Vsg-net:Spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13617-13626.
[17]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017.
[18]ZOU C,WANG B,HU Y,et al.End-to-end human object interaction detection with hoi transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:11825-11834.
[19]TAMURA M,OHASHI H,YOSHINAGA T.Qpic:Query-based pairwise human-object interaction detection with image-wide contextual informa-tion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10410-10419.
[20]ZHOU P,CHI M.Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:843-851.
[21]LIU H,MU T J,HUANG X.Detecting human—object interaction with multi-level pairwise feature network[J].Computa-tional Visual Media,2021,7:229-239.
[22]CHEN M,LIAO Y,LIU S,et al.Reformulating hoi detection as adaptive set prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:9004-9013.
[23]KIM B,LEE J,KANG J,et al.Hotr:End-to-end human-object interaction detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:74-83.
[24]PARKJ,PARK J W,LEE J S.Viplo:Vision transformer based pose-conditioned self-loop graph for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:17152-17162.
[25]MA S,WANG Y,WANG S,et al.Fgahoi:Fine-grained anchors for human-object interaction detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(4):2415-2429.
[26]WU M,GU J,SHEN Y,et al.End-to-end zero-shot hoi detection via vision and language knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:2839-2846.
[27]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.

Related Articles 15

[1]	YIN Shi, SHI Zhenyang, WU Menglin, CAI Jinyan, YU De. Deep Learning-based Kidney Segmentation in Ultrasound Imaging:Current Trends and Challenges [J]. Computer Science, 2025, 52(9): 16-24.
[2]	ZENG Lili, XIA Jianan, LI Shaowen, JING Maike, ZHAO Huihui, ZHOU Xuezhong. M2T-Net:Cross-task Transfer Learning Tongue Diagnosis Method Based on Multi-source Data [J]. Computer Science, 2025, 52(9): 47-53.
[3]	LI Yaru, WANG Qianqian, CHE Chao, ZHU Deheng. Graph-based Compound-Protein Interaction Prediction with Drug Substructures and Protein 3D Information [J]. Computer Science, 2025, 52(9): 71-79.
[4]	LUO Chi, LU Lingyun, LIU Fei. Partial Differential Equation Solving Method Based on Locally Enhanced Fourier NeuralOperators [J]. Computer Science, 2025, 52(9): 144-151.
[5]	LIU Leyuan, CHEN Gege, WU Wei, WANG Yong, ZHOU Fan. Survey of Data Classification and Grading Studies [J]. Computer Science, 2025, 52(9): 195-211.
[6]	LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[7]	PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching [J]. Computer Science, 2025, 52(9): 276-281.
[8]	GAO Long, LI Yang, WANG Suge. Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation [J]. Computer Science, 2025, 52(9): 313-319.
[9]	LIU Zhengyu, ZHANG Fan, QI Xiaofeng, GAO Yanzhao, SONG Yijing, FAN Wang. Review of Research on Deep Learning Compiler [J]. Computer Science, 2025, 52(8): 29-44.
[10]	TANG Boyuan, LI Qi. Review on Application of Spatial-Temporal Graph Neural Network in PM_2.5 ConcentrationForecasting [J]. Computer Science, 2025, 52(8): 71-85.
[11]	LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[12]	LIU Yajun, JI Qingge. Pedestrian Trajectory Prediction Based on Motion Patterns and Time-Frequency Domain Fusion [J]. Computer Science, 2025, 52(7): 92-102.
[13]	LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
[14]	ZHUANG Jianjun, WAN Li. SCF U²-Net:Lightweight U²-Net Improved Method for Breast Ultrasound Lesion SegmentationCombined with Fuzzy Logic [J]. Computer Science, 2025, 52(7): 161-169.
[15]	ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Human-Object Interaction Detection Based on Fine-grained Attention Mechanism

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0