计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 141-152.doi: 10.11896/jsjkx.250100086

• 计算机图形学&多媒体 • 上一篇    下一篇

PKHOI:利用先验知识增强人-物交互检测算法

赵文豪, 梅萌, 王小平, 罗航宇   

  1. 同济大学计算机科学与技术学院 上海 200092
  • 收稿日期:2025-01-14 修回日期:2025-03-30 发布日期:2026-01-08
  • 通讯作者: 王小平(xpwang6510@tongji.edu.cn)
  • 作者简介:(zwh1625@tongji.edu.cn)
  • 基金资助:
    国家重点研发计划(2022YFB4300504-4);上海市经济和信息化委员会专项基金(202201034)

PKHOI:Enhancing Human-Object Interaction Detection Algorithms with Prior Knowledge

ZHAO Wenhao, MEI Meng, WANG Xiaoping, LUO Hangyu   

  1. College of Computer Science and Technology, Tongji University, Shanghai 200092, China
  • Received:2025-01-14 Revised:2025-03-30 Online:2026-01-08
  • About author:ZHAO Wenhao,born in 2001,postgra-duate.His main research interests include computer vision and HOI detection.
    WANG Xiaoping,born in 1965,Ph.D,professor.His main research interests include AI algorithms,deep learning and computer vision.
  • Supported by:
    National Key Research and Development Program of China(2022YFB4300504-4)and Special Fund Project supported by Shanghai Municipal Commission of Economy and Information Technology(202201034).

摘要: 人-物交互检测(Human-Object Interaction,HOI)在视觉场景理解中起着至关重要的作用,随着深度学习技术的发展,基于视觉的交互检测模型已经能够获得良好的性能。然而,现有方法大多缺乏对先验的逻辑知识的运用,有时会推导出不合理的结果。其次,一些方法将空间信息和人体姿态信息用于推理,但它们仅在推理结果和标注之间构造损失,导致解码器无法学习到准确的隐含关系。因此,提出一种利用先验知识增强现有人-物交互检测算法的方法PKHOI,该方法能够有效增强现有人-物交互检测算法的准确性。具体而言,从训练集中构建了一个包含物品功能性、空间关系、人体姿态和动词共现的逻辑规则表,将其形式化为一阶逻辑并映射到连续空间中,在训练阶段和推理阶段分别以损失函数和矩阵乘法的形式将先验的逻辑规则融入神经网络,提升模型的准确性。此外,提出一种通过融合多模态信息(空间、语义和人体姿态信息)生成人-物对查询的方法,结合逻辑损失函数,可以引导解码器学习到更多的隐含知识。利用提出的方法增强了两个主流的人-物交互检测算法UPT和PViC,并在V-COCO,HICO-DET和Flickr30k数据集上进行了评估,实验结果表明,提出的方法可以有效提高现有方法的性能。

关键词: 人-物交互检测, 先验知识, 一阶逻辑, 姿态信息, 多模态信息融合

Abstract: HOI detection plays a crucial role in visual scene understanding.With the advancement of deep learning technologies,vision-based interaction detection models have achieved promising performance.However,most existing methods lack the utilization of prior logical knowledge,sometimes leading to unreasonable predictions.Additionally,while some methods employ spatial information and human pose information for reasoning,they only construct losses between inference results and annotations,preventing decoders from learning accurate implicit relationships.Therefore,this paper proposes the PKHOI method,which enhances existing HOI detection algorithms by leveraging prior knowledge,effectively improving the accuracy of current HOI detection algorithms.Specifically,it constructs a logical rule table from the training set,encompassing object functionality,spatial relationships,human poses,and verb co-occurrence.These rules are transformed into first-order logic and mapped to continuous space.The prior logical rules are then incorporated into neural networks through loss functions during training and matrix multiplication during inference,enhancing model accuracy.Furthermore,this paper proposes a method to generate human-object pair queries by fusing multimodal information(spatial,semantic,and human pose information).Combined with logical loss functions,this approach guides the decoder to learn more implicit knowledge.The proposed method enhances two mainstream HOI detection algorithms,UPT and PViC,and evaluates them on V-COCO,HICO-DET,and Flickr30k datasets.Experimental results demonstrate that the proposed method can effectively improve the performance of existing approaches.

Key words: Human-object interaction detection, Prior knowledge, First-order logic, Pose information, Multi-modal information fusion

中图分类号: 

  • TP391
[1]FANG H S,XIE Y C,SHAO D,et al.DIRV:Dense interaction region voting for end-to-end human-object interaction detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:1291-1299.
[2]KIM B,CHOI T,KANG J,et al.UnionDet:Union-level detectortowards real-time human-object interaction detection[C]//Computer Vision ECCV 2020:16th European Conference.2020:498-514.
[3]LIAO Y,LIU S,WANG F,et al.PPDM:Parallel point detection and matching for real-time human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:482-490.
[4]CHEN J,YANAI K.QAHOI:Query-based anchors for human-object interaction detection[C]//2023 18th International Conference on Machine Vision and Applications(MVA).New York:IEEE,2023:1-5.
[5]MA S,WANG Y,WANG S,et al.FGAHOI:Fine-grained anchors for human-object interaction detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(4):2415-2429.
[6]LIM J Y,BASKARAN V M,LIM J M Y,et al.ERNet:An efficient and reliable human-object interaction detection network[J].IEEE Transactions on Image Processing,2023,32:964-979.
[7]ZHANG A,LIAO Y,LIU S,et al.Mining the benefits of two-stage and one-stage HOI detection[J].Advances in Neural Information Processing Systems,2021,34:17209-17220.
[8]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformers[C]//Computer Vision ECCV 2020:16th European Conference.Berlin:Springer,2020:213-229.
[9]GIRSHICK R.Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE,2015:1440-1448.
[10]BANSAL A,RAMBHATLA S S,SHRIVASTAVA A,et al.Detecting human-object interactions via functional generalization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:10460-10469.
[11]LI Y L,LIU X P,LU H,et al.Detailed 2D-3D joint representation for human-object interaction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:10166-10175.
[12]GUPTA T,SCHWING A,HOIEM D.No-frills human-objectinteraction detection:Factorization,layout encodings,and trai-ning techniques[C]//Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision.New York:IEEE,2019:9677-9685.
[13]WU E Z Y,LI Y,WANG Y,et al.Exploring pose-aware human-object interaction via hybrid learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2024:17815-17825.
[14]PARK J,PARK J W,LEE J S.VIPLO:Vision transformer based pose-conditioned self-loop graph for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2023:17152-17162.
[15]ZHANG F Z,CAMPBELL D,GOULD S.Spatially conditioned graphs for detecting human-object interactions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:13319-13327.
[16]WANG T,ANWER R M,KHAN M H,et al.Deep contextual attention for human-object interaction detection[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2019:5694-5702.
[17]ZHANG F Z,CAMPBELL D,GOULD S.Efficient two-stagedetection of human-object interactions with a novel unary-pairwise transformer[C]//2022 IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition(CVPR).IEEE,2022:20072-20080.
[18]ZHANG F Z,YUAN Y,CAMPBELL D,et al.Exploring predicate visual context in detecting human-object interactions[C]//2023 IEEE/CVF International Conference on Computer Vision(ICCV).IEEE,2023:10377-10387.
[19]YUAN H,JIANG J,ALBANIE S,et al.RLIP:Relational language-image pre-training for human-object interaction detection[J].Advances in Neural Information Processing Systems,2022,35:37416-37431.
[20]YUAN H,ZHANG S W,WANG X,et al.RLIPv2:Fast scaling of relational language-image pre-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:21649-21661.
[21]NING S,QIU L,LIU Y,et al.HOICLIP:Efficient knowledge transfer for HOI detection with vision-language models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2023:23507-23517.
[22]CHAO Y W,LIU Y,LIU X,et al.Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision(WACV).New York:IEEE,2018:381-389.
[23]DILIGENTI M,ROYCHOWDHURY S,GORI M.Integratingprior knowledge into deep learning[C]//2017 16th IEEE International Conference on Machine Learning and Applications(ICMLA).New York:IEEE,2017:920-923.
[24]CHEN S,LENG Y,LABI S.A deep learning algorithm for si-mulating autonomous driving considering prior knowledge and temporal information[J].Computer-Aided Civil and Infrastructure Engineering,2020,35(4):305-321.
[25]DING X,LUO Y,LI Q,et al.Prior knowledge-based deep lear-ning method for indoor object recognition and application[J].Systems Science & Control Engineering,2018,6(1):249-257.
[26]ZHENG S,MAI S,SUN Y,et al.Subgraph-aware few-shot inductive link prediction via meta-learning[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(6):6512-6517.
[27]GENG Y,CHEN J,PAN J Z,et al.Relational message passing for fully inductive knowledge graph completion[C]//2023 IEEE 39th International Conference on Data Engineering(ICDE).New York:IEEE,2023:1221-1233.
[28]ZHANG Y,YAO Q.Knowledge graph reasoning with relational digraph[C]//Proceedings of the ACM Web Conference 2022.New York:ACM,2022:912-924.
[29]CHEN D,LAI H,GAO G,et al.Prior knowledge guided three-branch transformer for HOI detection[EB/OL].https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608308.
[30]LIAO Y,ZHANG A,LU M,et al.Gen-VLKT:Simplify association and enhance interaction understanding for HOI detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2022:20123-20132.
[31]GAO J,LIANG K,WEI T,et al.Dual-prior augmented decoding network for long tail distribution in HOI detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2024:1806-1814.
[32]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[33]GUPTA S,MALIK J.Visual semantic role labeling[J].arXiv:1505.04474,2015.
[34]ZHANG F Z,CAMPBELL D,GOULD S.Efficient two-stagedetection of human-object interactions with a novel unary-pairwise transformer[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE,2022:20104-20112.
[35]XU Y,ZHANG J,ZHANG Q,et al.ViTPose:Simple vision transformer baselines for human pose estimation[J].Advances in Neural Information Processing Systems,2022,35:38571-38584.
[36]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[37]KIM D J,SUN X,CHOI J,et al.Detecting human-object interactions with action co-occurrence priors[C]//Computer Vision-ECCV 2020:16th European Conference.Berlin:Springer,2020:718-736.
[38]VAN KRIEKEN E,ACAR E,VAN HARMELEN F.Analyzing differentiable fuzzy logic operators[J].Artificial Intelligence,2022,302:103602.
[39]SERAFINI L,D’AVILA GARCEZ A,BADREDDINE S,et al.Logic tensor networks:Theory and applications[M]//Neuro-Symbolic Artificial Intelligence:The State of the Art.Amsterdam:IOS,2021:370-394.
[40]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[41]TAMURA M,OHASHI H,YOSHINAGA T.QPIC:Query-based pairwise human-object interaction detection with image-wide contextual information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:10410-10419.
[42]ZHANG Y,PAN Y,YAO T,et al.Exploring structure-aware transformer over interaction proposals for human-object interaction detection[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2022:19526-19535.
[43]YANG J,LI B,YANG F,et al.Boosting human-object interaction detection with text-to-image diffusion model[J].arXiv:2305.12252,2023.
[44]ZHANG F Z,YUAN Y,CAMPBELL D,et al.Exploring predicate visual context in detecting human-object interactions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:10411-10421.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!