融合多层次视觉信息的人物交互动作识别

doi:10.11896/jsjkx.220700012

摘要/Abstract

摘要： 基于计算机视觉的人体动作识别技术在视频监控、智能驾驶、人机交互、多媒体内容审核等领域均有着广阔的应用前景,其中人体动作中的人物交互是动作识别的核心内容之一。现有的人物交互动作识别模型对人物关系的提取仅仅停留在表层视觉特征之上,并未充分挖掘人体关键区域以及人物之间的深层语义关系。针对此问题,文中提出了层次化的图神经网络模型(HGNN)对人物交互动作建模。HGNN模型从局部到整体显式地对人体关键区域以及人和物构成的场景图进行建模,并利用注意力图池化机制(AttPool)剔除层次图中冗余的信息和噪声,再通过图卷积网络提取图结点之间的深层语义关系,对卷积网络提取的特征进行聚合与优化,从而得到反映人物交互动作本质的特征表示。另外,HGNN模型在中层图进行的临时监督分类也能够约束网络更好地学习到交互动作的人体模式,避免网络对交互对象产生“偏见”。最后,针对HGNN模型,设计了多任务损失函数,用于有效进行模型训练。为了验证HGNN模型的有效性,在公开的大型数据集V-COCO上进行了广泛的实验,结果均显示所提出的HGNN模型对常见的人物交互动作具有广泛的适应性和鲁棒性,精度(mAP)超过了现有的基于图神经网络的模型,同时领先于大部分最新的多流卷积模型。

关键词: 计算机视觉, 人体动作识别, 人物交互, 深度学习, 图神经网络

Abstract: Computer vision based human action recognition technique has a broad application in the fields of video surveillance,intelligent driving,human-computer interaction,multimedia content audit,etc.More importantly,human-object interaction is one of the core components in human action recognition.Most of the existing human-object interaction action recognition models,which are based on multi-stream convolutional neural networks,only capturing the visual features superficially.They fail to fully explore the key areas of human body and the deep semantic relationship between human and objects.To solve this problem,this paper proposes a hierarchical graph neural network(HGNN) model.HGNN explicitly models the critical areas of the human body and the interaction of human-object in the scene from local to global,and uses an attention pooling mechanism(AttPool) to eliminate redundant information and noise in the graph.Then,the deep semantic relationship between graph nodes are captured by the graph convolution network,and the initial features extracted by convolutional neural network are aggregated and optimized.In this way,the feature representation which reflects the essential character of human-object interaction can be obtained.In addition,the interim supervised classification in the middle graph can also constrain the model to better learn the human patterns of interactive actions,and avoid the model to produce “bias” on the interactive objects.Finally,a multi-task loss function is designed for the HGNN to effectively train the model.To test and verify the effectiveness of the proposed HGNN model,extensive experimental evaluations on the famous public benchmark V-COCO have been conducted.The results show that the proposed HGNN model is adaptive and robust for human-object interaction detection,which outperforms the previous graph neural network based me-thods by a large margin,and also performs better than most of the latest convolutional neural network based models.

Key words: Computer vision, Human action recognition, Human-Object interaction, Deeplearning, Graph neural network

中图分类号:

TP391

李宝珍, 张晋, 王宝录, 余平. 融合多层次视觉信息的人物交互动作识别[J]. 计算机科学, 2022, 49(11A): 220700012-8. https://doi.org/10.11896/jsjkx.220700012

LI Bao-zhen, ZHANG Jin, WANG Bao-lu, YU Ping. Human-Object Interaction Recognition Integrating Multi-level Visual Features[J]. Computer Science, 2022, 49(11A): 220700012-8. https://doi.org/10.11896/jsjkx.220700012

参考文献

[1]CHAO Y W,LIU Y,LIU X,et al.Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision(WACV).IEEE,2018:381-389.
[2]QI S,WANG W,JIA B,et al.Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:401-417.
[3]TANG C,WANG W J,ZHANG C,et al.Human Action Recognition Using RGB-D Image Features[J].Pattern Recognition and Artificial Intelligence,2019,32(10):901-908.
[4]LI X,XIAO Q K.Human Action Recognition Using Auto-encode and PNN Neural Network[J].Software Guide,2018,17(1).
[5]ZHANG J,JIA Y,XIE W,et al.Zoom Transformer for Skeleton-based Group Activity Recognition[C]//IEEE Transactions on Circuits and Systems for Video Technology.2022.
[6]ZHANG J, YE G,TU Z,et al.A spatial attentive and temporal dilated (SATD) GCN for skeleton-based action recognition[J].CAAI Transactions on Intelligence Technology,2022,7(1):46-55.
[7]ZITNIK M,LESKOVEC J.Predicting multicellular function throu-gh multi-layer tissue networks[J].Bioinformatics,2017,33(14):190-198.
[8]ZHOU Z,WANG Y,XIE X,et al.RiskOracle:A Minute-level Citywide Traffic Accident Forecasting Framework[J].arXiv:2003.00819,2020.
[9]TANG L,LIU H.Relational learning via latent social dimensions[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2009:817-826.
[10]SUHAIL M,SIGAL L.Mixture-Kernel Graph Attention Net-work for Situation Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10363-10372.
[11]WU J,WANG L,WANG L,et al.Learning actor relation graphs for group activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:9964-9974.
[12]SCARSELLI F,GORI M,TSOI A C,et al.The graph neural network model[J].IEEE Transactions on Neural Networks,2008,20(1):61-80.
[13]DEFFERRARD M,BRESSON X,VANDERGHEYNST P.Convolutional neural networks on graphs with fast localized spectral filtering[C]//Advances in Neural Information Processing Systems.2016:3844-3852.
[14]KIPF T N,WELLING M.Semi-supervised classification withgraph convolutional networks[J].arXiv:1609.02907,2016.
[15]MONTI F,BOSCAINI D,MASCI J,et al.Geometric deep lear-ning on graphs and manifolds using mixture model cnns[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5115-5124.
[16]YAN S,XIONG Y,LIN D.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Thirty-se-cond AAAI Conference on Artificial Intelligence.2018.
[17]LI L,GAN Z,CHENG Y,et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10313-10322.
[18]HENAFF M,BRUNA J,LECUN Y.Deep convolutional net-works on graph-structured data[J].arXiv:1506.05163,2015.
[19]DEFFERRARD M,BRESSON X,VANDERGHEYNST P.Convolutional neural networks on graphs with fast localized spectral filtering[C]//Advances in Neural Information Processing Systems.2016:3844-3852.
[20]JOAN B,WOJCIECH Z,ARTHUR S,et al.Spectral networks and locally connected networks ongraphs[J].arXiv:1312.6203,2013.
[21]GAO H,JI S.Graph u-nets[J].arXiv:1905.05178,2019.
[22]YING Z,YOU J,MORRIS C,et al.Hierarchicalgraph representation learning with differentiable pooling[C]//Advances in Neural Information Processing Systems.2018:4800-4810.
[23]LEE J,LEE I,KANG J.Self-attention graph pooling[J].arXiv:1904.08082,2019.
[24]HUANG J,LI Z,LI N,et al.AttPool:Towards HierarchicalFeature Representation in Graph Convolutional Networks via Attention Mechanism[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:6480-6489.
[25]GUPTA A,KEMBHAVI A,DAVIS L S.Observing human-object interactions:Using spatial and functional compatibility for recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(10):1775-1789.
[26]SAURABH G,JITENDRA M.Visual semantic role labeling[J].arXiv:1505.04474,2015.
[27]CHAO Y W,WANG Z,HE Y,et al.Hico:A benchmark for reco-gnizing human-object interactions in images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1017-1025.
[28]GKIOXARI G,GIRSHICK R,DOLLAR P,et al.Detecting and recognizing human-object interactions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8359-8367.
[29]GAO C,ZOU Y,HUANG J B.ican:Instance-centric attention network for human-object interaction detection[J].arXiv:1808.10437,2018.
[30]LI Y L,ZHOU S,HUANG X,et al.Transferable interactive-ness knowledge for human-object interaction detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:3585-3594.
[31]WAN B,ZHOU D,LIU Y,et al.Pose-aware Multi-level Feature Network for Human Object Interaction Detection[C]//Procee-dings of the IEEE International Conference on Computer Vision.2019:9469-9478.
[32]ZHOU P,CHI M.Relation Parsing Neural Network for Human-Object Interaction Detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:843-851.
[33]XU B,WONG Y,LI J,et al.Learning to detect human-object interactions with knowledge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019.
[34]HE K,GKIOXARI G,DOLLAR P,et al.Mask r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969.
[35]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An imperative style,high-performance deep learning library[C]//Advances in Neural Information Processing Systems.2019:8024-8035.
[36]RAHMAN M A,WANG Y.Optimizing intersection-over-union in deep neural networks for image segmentation[C]//International symposium on visual computing.Cham:Springer,2016:234-244.

相关文章 15

[1]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2]	徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[3]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[4]	汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[5]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[6]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[7]	王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[8]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[9]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[10]	齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[11]	杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[12]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[13]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[14]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[15]	周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed