Computer Science ›› 2026, Vol. 53 ›› Issue (1): 141-152.doi: 10.11896/jsjkx.250100086

• Computer Graphics & Multimedia • Previous Articles     Next Articles

PKHOI:Enhancing Human-Object Interaction Detection Algorithms with Prior Knowledge

ZHAO Wenhao, MEI Meng, WANG Xiaoping, LUO Hangyu   

  1. College of Computer Science and Technology, Tongji University, Shanghai 200092, China
  • Received:2025-01-14 Revised:2025-03-30 Published:2026-01-08
  • About author:ZHAO Wenhao,born in 2001,postgra-duate.His main research interests include computer vision and HOI detection.
    WANG Xiaoping,born in 1965,Ph.D,professor.His main research interests include AI algorithms,deep learning and computer vision.
  • Supported by:
    National Key Research and Development Program of China(2022YFB4300504-4)and Special Fund Project supported by Shanghai Municipal Commission of Economy and Information Technology(202201034).

Abstract: HOI detection plays a crucial role in visual scene understanding.With the advancement of deep learning technologies,vision-based interaction detection models have achieved promising performance.However,most existing methods lack the utilization of prior logical knowledge,sometimes leading to unreasonable predictions.Additionally,while some methods employ spatial information and human pose information for reasoning,they only construct losses between inference results and annotations,preventing decoders from learning accurate implicit relationships.Therefore,this paper proposes the PKHOI method,which enhances existing HOI detection algorithms by leveraging prior knowledge,effectively improving the accuracy of current HOI detection algorithms.Specifically,it constructs a logical rule table from the training set,encompassing object functionality,spatial relationships,human poses,and verb co-occurrence.These rules are transformed into first-order logic and mapped to continuous space.The prior logical rules are then incorporated into neural networks through loss functions during training and matrix multiplication during inference,enhancing model accuracy.Furthermore,this paper proposes a method to generate human-object pair queries by fusing multimodal information(spatial,semantic,and human pose information).Combined with logical loss functions,this approach guides the decoder to learn more implicit knowledge.The proposed method enhances two mainstream HOI detection algorithms,UPT and PViC,and evaluates them on V-COCO,HICO-DET,and Flickr30k datasets.Experimental results demonstrate that the proposed method can effectively improve the performance of existing approaches.

Key words: Human-object interaction detection, Prior knowledge, First-order logic, Pose information, Multi-modal information fusion

CLC Number: 

  • TP391
[1]FANG H S,XIE Y C,SHAO D,et al.DIRV:Dense interaction region voting for end-to-end human-object interaction detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:1291-1299.
[2]KIM B,CHOI T,KANG J,et al.UnionDet:Union-level detectortowards real-time human-object interaction detection[C]//Computer Vision ECCV 2020:16th European Conference.2020:498-514.
[3]LIAO Y,LIU S,WANG F,et al.PPDM:Parallel point detection and matching for real-time human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:482-490.
[4]CHEN J,YANAI K.QAHOI:Query-based anchors for human-object interaction detection[C]//2023 18th International Conference on Machine Vision and Applications(MVA).New York:IEEE,2023:1-5.
[5]MA S,WANG Y,WANG S,et al.FGAHOI:Fine-grained anchors for human-object interaction detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(4):2415-2429.
[6]LIM J Y,BASKARAN V M,LIM J M Y,et al.ERNet:An efficient and reliable human-object interaction detection network[J].IEEE Transactions on Image Processing,2023,32:964-979.
[7]ZHANG A,LIAO Y,LIU S,et al.Mining the benefits of two-stage and one-stage HOI detection[J].Advances in Neural Information Processing Systems,2021,34:17209-17220.
[8]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformers[C]//Computer Vision ECCV 2020:16th European Conference.Berlin:Springer,2020:213-229.
[9]GIRSHICK R.Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE,2015:1440-1448.
[10]BANSAL A,RAMBHATLA S S,SHRIVASTAVA A,et al.Detecting human-object interactions via functional generalization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:10460-10469.
[11]LI Y L,LIU X P,LU H,et al.Detailed 2D-3D joint representation for human-object interaction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:10166-10175.
[12]GUPTA T,SCHWING A,HOIEM D.No-frills human-objectinteraction detection:Factorization,layout encodings,and trai-ning techniques[C]//Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision.New York:IEEE,2019:9677-9685.
[13]WU E Z Y,LI Y,WANG Y,et al.Exploring pose-aware human-object interaction via hybrid learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2024:17815-17825.
[14]PARK J,PARK J W,LEE J S.VIPLO:Vision transformer based pose-conditioned self-loop graph for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2023:17152-17162.
[15]ZHANG F Z,CAMPBELL D,GOULD S.Spatially conditioned graphs for detecting human-object interactions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:13319-13327.
[16]WANG T,ANWER R M,KHAN M H,et al.Deep contextual attention for human-object interaction detection[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2019:5694-5702.
[17]ZHANG F Z,CAMPBELL D,GOULD S.Efficient two-stagedetection of human-object interactions with a novel unary-pairwise transformer[C]//2022 IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition(CVPR).IEEE,2022:20072-20080.
[18]ZHANG F Z,YUAN Y,CAMPBELL D,et al.Exploring predicate visual context in detecting human-object interactions[C]//2023 IEEE/CVF International Conference on Computer Vision(ICCV).IEEE,2023:10377-10387.
[19]YUAN H,JIANG J,ALBANIE S,et al.RLIP:Relational language-image pre-training for human-object interaction detection[J].Advances in Neural Information Processing Systems,2022,35:37416-37431.
[20]YUAN H,ZHANG S W,WANG X,et al.RLIPv2:Fast scaling of relational language-image pre-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:21649-21661.
[21]NING S,QIU L,LIU Y,et al.HOICLIP:Efficient knowledge transfer for HOI detection with vision-language models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2023:23507-23517.
[22]CHAO Y W,LIU Y,LIU X,et al.Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision(WACV).New York:IEEE,2018:381-389.
[23]DILIGENTI M,ROYCHOWDHURY S,GORI M.Integratingprior knowledge into deep learning[C]//2017 16th IEEE International Conference on Machine Learning and Applications(ICMLA).New York:IEEE,2017:920-923.
[24]CHEN S,LENG Y,LABI S.A deep learning algorithm for si-mulating autonomous driving considering prior knowledge and temporal information[J].Computer-Aided Civil and Infrastructure Engineering,2020,35(4):305-321.
[25]DING X,LUO Y,LI Q,et al.Prior knowledge-based deep lear-ning method for indoor object recognition and application[J].Systems Science & Control Engineering,2018,6(1):249-257.
[26]ZHENG S,MAI S,SUN Y,et al.Subgraph-aware few-shot inductive link prediction via meta-learning[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(6):6512-6517.
[27]GENG Y,CHEN J,PAN J Z,et al.Relational message passing for fully inductive knowledge graph completion[C]//2023 IEEE 39th International Conference on Data Engineering(ICDE).New York:IEEE,2023:1221-1233.
[28]ZHANG Y,YAO Q.Knowledge graph reasoning with relational digraph[C]//Proceedings of the ACM Web Conference 2022.New York:ACM,2022:912-924.
[29]CHEN D,LAI H,GAO G,et al.Prior knowledge guided three-branch transformer for HOI detection[EB/OL].https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608308.
[30]LIAO Y,ZHANG A,LU M,et al.Gen-VLKT:Simplify association and enhance interaction understanding for HOI detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2022:20123-20132.
[31]GAO J,LIANG K,WEI T,et al.Dual-prior augmented decoding network for long tail distribution in HOI detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2024:1806-1814.
[32]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[33]GUPTA S,MALIK J.Visual semantic role labeling[J].arXiv:1505.04474,2015.
[34]ZHANG F Z,CAMPBELL D,GOULD S.Efficient two-stagedetection of human-object interactions with a novel unary-pairwise transformer[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE,2022:20104-20112.
[35]XU Y,ZHANG J,ZHANG Q,et al.ViTPose:Simple vision transformer baselines for human pose estimation[J].Advances in Neural Information Processing Systems,2022,35:38571-38584.
[36]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[37]KIM D J,SUN X,CHOI J,et al.Detecting human-object interactions with action co-occurrence priors[C]//Computer Vision-ECCV 2020:16th European Conference.Berlin:Springer,2020:718-736.
[38]VAN KRIEKEN E,ACAR E,VAN HARMELEN F.Analyzing differentiable fuzzy logic operators[J].Artificial Intelligence,2022,302:103602.
[39]SERAFINI L,D’AVILA GARCEZ A,BADREDDINE S,et al.Logic tensor networks:Theory and applications[M]//Neuro-Symbolic Artificial Intelligence:The State of the Art.Amsterdam:IOS,2021:370-394.
[40]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[41]TAMURA M,OHASHI H,YOSHINAGA T.QPIC:Query-based pairwise human-object interaction detection with image-wide contextual information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:10410-10419.
[42]ZHANG Y,PAN Y,YAO T,et al.Exploring structure-aware transformer over interaction proposals for human-object interaction detection[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2022:19526-19535.
[43]YANG J,LI B,YANG F,et al.Boosting human-object interaction detection with text-to-image diffusion model[J].arXiv:2305.12252,2023.
[44]ZHANG F Z,YUAN Y,CAMPBELL D,et al.Exploring predicate visual context in detecting human-object interactions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:10411-10421.
[1] ZENG Dan, HE Xingxing, LI Yingfang, LI Tianrui. Structures of Multi-line Standard Contradictions in First-order Logic [J]. Computer Science, 2025, 52(12): 200-208.
[2] DING Yuanbo, BAI Lin, LI Taoshen. Human-Object Interaction Detection Based on Fine-grained Attention Mechanism [J]. Computer Science, 2025, 52(11): 141-149.
[3] YIN Baosheng, ZHOU Peng. Chinese Medical Named Entity Recognition with Label Knowledge [J]. Computer Science, 2024, 51(6A): 230500203-7.
[4] DONG Zhen-heng, REN Wei-ping, YOU Xin-dong, LYU Xue-qiang. Machine Translation Method Integrating New Energy Terminology Knowledge [J]. Computer Science, 2022, 49(6): 305-312.
[5] LIU Xin, YUAN Jia-bin, WANG Tian-xing. Interior Human Action Recognition Method Based on Prior Knowledge of Scene [J]. Computer Science, 2022, 49(1): 225-232.
[6] HAO Jiao, HUI Xiao-jing, MA Shuo, JIN Ming-hui. Study on Axiomatic Truth Degree in First-order Logic [J]. Computer Science, 2021, 48(11A): 669-671.
[7] CAO Feng,XU Yang,ZHONG Jian,NING Xin-ran. First-order Logic Clause Set Preprocessing Method Based on Goal Deduction Distance [J]. Computer Science, 2020, 47(3): 217-221.
[8] TIAN Zhen-kun, FU Ying-ying, LIU Su-hong. Remote Sensing Image Classification Based on Heterogeneous Machine Learning Algorithm Fusion [J]. Computer Science, 2019, 46(5): 235-240.
[9] ZHAO Jia-min,FENG Ai-min,CHEN Song-can and PAN Zhi-song. Maximum Constrained Density One-class Classifier [J]. Computer Science, 2014, 41(2): 59-63.
[10] YU Xu,YANG Jing,XIE Zhi-qiang. Research on Virtual Sample Generation Technology [J]. Computer Science, 2011, 38(3): 16-19.
[11] LI Lin-na,CHEN Hai-rui,WANG Ying-long. Semi-supervised Clustering of Complex Structured Data Based on Higher-order Logic [J]. Computer Science, 2009, 36(9): 196-200.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!