一种结合多尺度特征图和环型关系推理的场景图生成模型

doi:10.11896/jsjkx.190300002

Abstract

Abstract: The scene graph is a graph describing image content.There are two problems in its generation:one is the loss of useful information caused by two-step scene graph generation method,which promotes the difficulty of this working,and the second is the model overfitting due to the long-tail distribution of visual relationship,which increases the error rate of relationship reasoning.To solve these two problems,a scene graph generation model SGiF (Scene Graph in Features) based on multi-scale feature map and ring-type relationship reasoning was proposed.Firstly,the possibility of visual relationship is calculated for each feature point on the multi-scale feature map and the features with high possibility are extracted.Then,the subject-object combination is decoded from extracted features.According to the difference of the decoding result category,the result will be deduplicated and the scene graph structure will be obtained.Finally,the ring including the targeted relationship edge is detected according to the graph structure,then the other edges of this ring are used as input of the calculation about factor to adjust the original relationship reasoning result,at last,the scene graph generation work is completed.In this paper,SGGen and PredCls were used as verification items.The experimental results on the subset of large dataset VG (Visual Genome) used for scene graph generation show that,by using multi-scale feature map,SGiF improves the hit rate of visual relationship detection by 7.1% compared with the two-step baseline,and by using the ring-type relationship reasoning,SGiF improves the accuracy of relational reasoning by 2.18% compared with the baseline with non-ring relational reasoning,thus proving the effectiveness of SGiF.

Key words: Convolution neural networks, Image understanding, Multi-scale feature map, Ring-type relationship reasoning, Scene graph generation

CLC Number:

TP389.1

ZHUANG Zhi-gang, XU Qing-lin. Scene Graph Generation Model Combining Multi-scale Feature Map and Ring-type RelationshipReasoning[J].Computer Science, 2020, 47(4): 136-141.

References

[1]KAREN S,ANDREW Z.Very Deep Convolutional Networksfor Large-Scale Image Recognition[C]//International Conference on Learning Representations (ICLR).2015.
[2]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2016:770-778.
[3]HUANG G,LIU Z,LAURENS V D M,et al.Densely Connected Convolutional Networks[C]//IEEE Conference on Compu-ter Vision and Pattern Recognition (CVPR).2017:4700-4708.
[4]REDMON J,DIVVALA S,GIRSHICK R,et al.You Only Look Once:Unified,Real-Time Object Detection[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition (CVPR).2016:779-778.
[5]REDMON J,FARHADI A.YOLO9000:Better,Faster,Stronger[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:7263-7271.
[6]REDMON J,FARHADI A.YOLOv3:An Incremental Improvement [J].arXiv:1804.02767.
[7]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single ShotMultiBox Detector[C]//European Conference on Computer Vision (ECCV).2016:21-37.
[8]LIN T Y,DOLLR,PIOTR,et al.Feature Pyramid Networks for Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:2117-2125.
[9]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//IEEE International Conference on Computer Vision (ICCV).2017:2980-2988.
[10]ROSS B G,JEFF D,TREVOR D,et al.Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2014:580-587.
[11]GIRSHICK R.Fast R-CNN[C]//IEEE International Conference on Computer Vision (ICCV).2015:1440-1448.
[12]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:TowardsReal-Time Object Detection with Region Proposal Networks[C]//Neural Information Processing Systems (NIPS).2015:91-99.
[13]KAIMING H,GEORGIA G,PIOTR D,et al.Mask R-CNN[C]//IEEE International Conference on Computer Vision (ICCV).2017:2961-2969.
[14]LI Y,QI H,DAI J,et al.Fully Convolutional Instance-aware Semantic Segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:2359-2367.
[15]AGRAWAL A,LU J,ANTOL S,et al.VQA:Visual Question Answering [J].International Journal of Computer Vision,2017,123(1):4-31.
[16]JOHNSON J,HARIHARAN B,LAURENS V D M,et al.CLEVR:A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:2901-2910.
[17]ORDONEZ V,KULKARNI G,BERG T.Im2text:Describingimages using 1 million captioned photographs[C]//Neural Information Processing Systems (NIPS).2011:1143-1151.
[18]VINYALS O,TOSHEV A,BENGIO S,et al.Show and Tell:A Neural Image Caption Generator[C]//The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2015:3156-3164.
[19]CARNEIRO G.Supervised Learning of Semantic Classes for Image Annotation and Retrieval [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2007,29(3):394-410.
[20]VOGEL J,SCHIELE B.Semantic Modeling of Natural Scenes for Content-Based Image Retrieval [J].International Journal of Computer Vision,2007,72(2):133-157.
[21]LU C,KRISHNA R,BERNSTEIN M,et al.Visual Relationship Detection with Language Priors[C]//European Conference on Computer Vision (ECCV).2016:852-869.
[22]ZHANG H,KYAW Z,CHANG S F,et al.Visual Translation Embedding Network for Visual Relation Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:5532-5540.
[23]DAI B,ZHANG Y,LIN D.Detecting Visual Relationships with Deep Relational Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:3076-3086.
[24]XU D,ZHU Y,CHOY C B,et al.Scene Graph Generation by Iterative Message Passing[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2017:5410-5419.
[25]ZELLERS R,YATSKAR M,THOMSON S,et al.Neural Motifs:Scene Graph Parsing with Global Context[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2018:5831-5840.
[26]NEWELL A,DENG J.Pixels to Graphs by Associative Embedding[C]//Neural Information Processing Systems (NIPS).2017:2171-2180.
[27]LIBEN-NOWELL D,KLEINBERG J.The Link PredictionProblem for Social Networks [J].Journal of the American Socie-ty for Information Science and Technology,2003,58(7):1019-1031.
[28]BACKSTROM L,LESKOVEC J.Supervised Random Walks:Predicting and Recommending Links in Social Networks[C]//Proceedings of the Fourth ACM International Conference on Web Search and Data Mining.2011:635-644.
[29]ANTOINE B,NICOLAS U,ALBERTO G D,et al.Translating Embeddings for Modeling Multi-relational Data[C]//Neural Information Processing Systems (NIPS).2013:2787-2795.
[30]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision using Crowdsourced Dense Image Annotations [J].International Journal of Computer Vision,2017,123(1):32-73.
[31]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:A Simple Way to Prevent Neural Networks from Overfitting [J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[32]TOKUI S,OONO K,HIDO S.Chainer:a Next-Generation Open Source Framework for Deep Learning[C]//Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS).2015.
[33]RYOSUKE O,YUYA U,et al.CuPy:A NumPy-Compatible Library for NVIDIA GPU Calculations[C]//Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Proces-sing Systems (NIPS).2017.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Scene Graph Generation Model Combining Multi-scale Feature Map and Ring-type RelationshipReasoning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 2

Metrics

Comments

Recommended 0

[1]	ZHANG Yong-liang, ZHANG Zhi-qin, WU Hong-tao, DONG Ling-ping and ZHOU Bing. Perimeter Intrusion Detection Based on Improved Convolution Neural Networks [J]. Computer Science, 2017, 44(3): 182-186.
[2]	ZHOU Hai-ying and MU Zhi-chun. Image Understanding Model Based on Local Visual Perception and Semantic Association [J]. Computer Science, 2013, 40(7): 258-261.