一种结合多尺度特征图和环型关系推理的场景图生成模型

doi:10.11896/jsjkx.190300002

摘要/Abstract

摘要： 场景图为描述图像内容的结构图(Graph),其在生成过程中存在两个问题:1)二步式场景图生成方法造成有益信息流失,使得任务难度提高;2)视觉关系长尾分布使得模型发生过拟合、关系推理错误率上升。针对这两个问题,文中提出结合多尺度特征图和环型关系推理的场景图生成模型SGiF(Scene Graph in Features)。首先,计算多尺度特征图上的每一特征点存在视觉关系的可能性,并将存在可能性高的特征点特征提取出来;然后,从被提取出的特征中解码得到主宾组合,根据解码结果的类别差异,对结果进行去重,以此得到场景图结构;最后,根据场景图结构检测包含目标关系边在内的环路,将环路上的其他边作为计算调整因子的输入,以该因子调整原关系推理结果,并最终完成场景图的生成。实验设置SGGen和PredCls作为验证项,在大型场景图生成数据集VG(Visual Genome)子集上的实验结果表明,通过使用多尺度特征图,相比二步式基线,SGiF的视觉关系检测命中率提升了7.1%,且通过使用环型关系推理,相比非环型关系推理基线,SGiF的关系推理命中率提升了2.18%,从而证明了SGiF的有效性。

关键词: 场景图生成, 多尺度特征图, 环型关系推理, 卷积神经网络, 图像理解

Abstract: The scene graph is a graph describing image content.There are two problems in its generation:one is the loss of useful information caused by two-step scene graph generation method,which promotes the difficulty of this working,and the second is the model overfitting due to the long-tail distribution of visual relationship,which increases the error rate of relationship reasoning.To solve these two problems,a scene graph generation model SGiF (Scene Graph in Features) based on multi-scale feature map and ring-type relationship reasoning was proposed.Firstly,the possibility of visual relationship is calculated for each feature point on the multi-scale feature map and the features with high possibility are extracted.Then,the subject-object combination is decoded from extracted features.According to the difference of the decoding result category,the result will be deduplicated and the scene graph structure will be obtained.Finally,the ring including the targeted relationship edge is detected according to the graph structure,then the other edges of this ring are used as input of the calculation about factor to adjust the original relationship reasoning result,at last,the scene graph generation work is completed.In this paper,SGGen and PredCls were used as verification items.The experimental results on the subset of large dataset VG (Visual Genome) used for scene graph generation show that,by using multi-scale feature map,SGiF improves the hit rate of visual relationship detection by 7.1% compared with the two-step baseline,and by using the ring-type relationship reasoning,SGiF improves the accuracy of relational reasoning by 2.18% compared with the baseline with non-ring relational reasoning,thus proving the effectiveness of SGiF.

Key words: Convolution neural networks, Image understanding, Multi-scale feature map, Ring-type relationship reasoning, Scene graph generation

中图分类号:

TP389.1

庄志刚, 许青林. 一种结合多尺度特征图和环型关系推理的场景图生成模型[J]. 计算机科学, 2020, 47(4): 136-141. https://doi.org/10.11896/jsjkx.190300002

ZHUANG Zhi-gang, XU Qing-lin. Scene Graph Generation Model Combining Multi-scale Feature Map and Ring-type RelationshipReasoning[J]. Computer Science, 2020, 47(4): 136-141. https://doi.org/10.11896/jsjkx.190300002

参考文献

[1]KAREN S,ANDREW Z.Very Deep Convolutional Networksfor Large-Scale Image Recognition[C]//International Conference on Learning Representations (ICLR).2015.
[2]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2016:770-778.
[3]HUANG G,LIU Z,LAURENS V D M,et al.Densely Connected Convolutional Networks[C]//IEEE Conference on Compu-ter Vision and Pattern Recognition (CVPR).2017:4700-4708.
[4]REDMON J,DIVVALA S,GIRSHICK R,et al.You Only Look Once:Unified,Real-Time Object Detection[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition (CVPR).2016:779-778.
[5]REDMON J,FARHADI A.YOLO9000:Better,Faster,Stronger[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:7263-7271.
[6]REDMON J,FARHADI A.YOLOv3:An Incremental Improvement [J].arXiv:1804.02767.
[7]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single ShotMultiBox Detector[C]//European Conference on Computer Vision (ECCV).2016:21-37.
[8]LIN T Y,DOLLR,PIOTR,et al.Feature Pyramid Networks for Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:2117-2125.
[9]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//IEEE International Conference on Computer Vision (ICCV).2017:2980-2988.
[10]ROSS B G,JEFF D,TREVOR D,et al.Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2014:580-587.
[11]GIRSHICK R.Fast R-CNN[C]//IEEE International Conference on Computer Vision (ICCV).2015:1440-1448.
[12]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:TowardsReal-Time Object Detection with Region Proposal Networks[C]//Neural Information Processing Systems (NIPS).2015:91-99.
[13]KAIMING H,GEORGIA G,PIOTR D,et al.Mask R-CNN[C]//IEEE International Conference on Computer Vision (ICCV).2017:2961-2969.
[14]LI Y,QI H,DAI J,et al.Fully Convolutional Instance-aware Semantic Segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:2359-2367.
[15]AGRAWAL A,LU J,ANTOL S,et al.VQA:Visual Question Answering [J].International Journal of Computer Vision,2017,123(1):4-31.
[16]JOHNSON J,HARIHARAN B,LAURENS V D M,et al.CLEVR:A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:2901-2910.
[17]ORDONEZ V,KULKARNI G,BERG T.Im2text:Describingimages using 1 million captioned photographs[C]//Neural Information Processing Systems (NIPS).2011:1143-1151.
[18]VINYALS O,TOSHEV A,BENGIO S,et al.Show and Tell:A Neural Image Caption Generator[C]//The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2015:3156-3164.
[19]CARNEIRO G.Supervised Learning of Semantic Classes for Image Annotation and Retrieval [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2007,29(3):394-410.
[20]VOGEL J,SCHIELE B.Semantic Modeling of Natural Scenes for Content-Based Image Retrieval [J].International Journal of Computer Vision,2007,72(2):133-157.
[21]LU C,KRISHNA R,BERNSTEIN M,et al.Visual Relationship Detection with Language Priors[C]//European Conference on Computer Vision (ECCV).2016:852-869.
[22]ZHANG H,KYAW Z,CHANG S F,et al.Visual Translation Embedding Network for Visual Relation Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:5532-5540.
[23]DAI B,ZHANG Y,LIN D.Detecting Visual Relationships with Deep Relational Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:3076-3086.
[24]XU D,ZHU Y,CHOY C B,et al.Scene Graph Generation by Iterative Message Passing[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:5410-5419.
[25]ZELLERS R,YATSKAR M,THOMSON S,et al.Neural Motifs:Scene Graph Parsing with Global Context[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2018:5831-5840.
[26]NEWELL A,DENG J.Pixels to Graphs by Associative Embedding[C]//Neural Information Processing Systems (NIPS).2017:2171-2180.
[27]LIBEN-NOWELL D,KLEINBERG J.The Link PredictionProblem for Social Networks [J].Journal of the American Socie-ty for Information Science and Technology,2003,58(7):1019-1031.
[28]BACKSTROM L,LESKOVEC J.Supervised Random Walks:Predicting and Recommending Links in Social Networks[C]//Proceedings of the Fourth ACM International Conference on Web Search and Data Mining.2011:635-644.
[29]ANTOINE B,NICOLAS U,ALBERTO G D,et al.Translating Embeddings for Modeling Multi-relational Data[C]//Neural Information Processing Systems (NIPS).2013:2787-2795.
[30]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision using Crowdsourced Dense Image Annotations [J].International Journal of Computer Vision,2017,123(1):32-73.
[31]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:A Simple Way to Prevent Neural Networks from Overfitting [J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[32]TOKUI S,OONO K,HIDO S.Chainer:a Next-Generation Open Source Framework for Deep Learning[C]//Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS).2015.
[33]RYOSUKE O,YUYA U,et al.CuPy:A NumPy-Compatible Library for NVIDIA GPU Calculations[C]//Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Proces-sing Systems (NIPS).2017.

相关文章 15

[1]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2]	李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[3]	陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[4]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[5]	檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6]	张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[7]	戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[8]	刘月红, 牛少华, 神显豪. 基于卷积神经网络的虚拟现实视频帧内预测编码 Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network 计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179
[9]	徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[10]	金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[11]	杨玥, 冯涛, 梁虹, 杨扬. 融合交叉注意力机制的图像任意风格迁移 Image Arbitrary Style Transfer via Criss-cross Attention 计算机科学, 2022, 49(6A): 345-352. https://doi.org/10.11896/jsjkx.210700236
[12]	杨健楠, 张帆. 一种结合双注意力机制和层次网络结构的细碎农作物分类方法 Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure 计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[13]	孙福权, 崔志清, 邹彭, 张琨. 基于多尺度特征的脑肿瘤分割算法 Brain Tumor Segmentation Algorithm Based on Multi-scale Features 计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217
[14]	吴子斌, 闫巧. 基于动量的映射式梯度下降算法 Projected Gradient Descent Algorithm with Momentum 计算机科学, 2022, 49(6A): 178-183. https://doi.org/10.11896/jsjkx.210500039
[15]	杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行. 基于步态分类辅助的虚拟IMU的行人导航方法 Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification 计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed