计算机科学 ›› 2020, Vol. 47 ›› Issue (6): 133-137.doi: 10.11896/jsjkx.190600110

• 计算机图形学&多媒体 • 上一篇    下一篇

结合注意力机制与特征融合的场景图生成模型

黄勇韬, 严华   

  1. 四川大学电子信息学院 成都610065
  • 收稿日期:2019-06-20 出版日期:2020-06-15 发布日期:2020-06-10
  • 通讯作者: 严华(yanhua@scu.edu.cn)
  • 作者简介:15198172896@163.com
  • 基金资助:
    国家自然科学基金项目(61403265)

Scene Graph Generation Model Combining Attention Mechanism and Feature Fusion

HUANG Yong-tao, YAN Hua   

  1. School of Electronics and Information Engineering,Sichuan University,Chengdu 610065,China
  • Received:2019-06-20 Online:2020-06-15 Published:2020-06-10
  • About author:HUANG Yong-tao,born in 1995,postgraduate,is not member of China Computer Federation.His main research interests include computer vision,deep learning and parallel computing.
    YAN Hua,born in 1971,Ph.D,professor.His main research interests include intelligent algorithm,storage system and path planning.
  • Supported by:
    This work was supported by the Natural Science Foundation of China(61403265).

摘要: 视觉场景理解不仅可以孤立地识别单个物体,还可以得到不同物体之间的相互作用关系。场景图可以获取所有的(主语-谓词-宾语)信息来描述图像内部的对象关系,在场景理解任务中应用广泛。然而,大部分已有的场景图生成模型结构复杂、推理速度慢、准确率低,不能在现实情况下直接使用。因此,在Factorizable Net的基础上提出了一种结合注意力机制与特征融合的场景图生成模型。首先把整个图片分解为若干个子图,每个子图包含多个对象及对象间的关系;然后在物体特征中融合其位置和形状信息,并利用注意力机制实现物体特征和子图特征之间的消息传递;最后根据物体特征和子图特征分别进行物体分类和物体间关系推断。实验结果表明,在多个视觉关系检测数据集上,该模型视觉关系检测的准确率为22.78%~25.41%,场景图生成的准确率为16.39%~22.75%,比Factorizable Net分别提升了1.2%和1.8%;并且利用一块GTX1080Ti显卡可以在0.6 s之内实现对一幅图像的物体和物体间的关系进行检测。实验数据充分说明,采用子图结构明显减少了需要进行关系推断的图像区域数量,利用特征融合方法和基于注意力机制的消息传递机制提升了深度特征的表现能力,可以更快速准确地预测对象及其关系,从而有效解决了传统的场景图生成模型时效性差、准确度低的难题。

关键词: 场景图, 视觉关系检测, 特征融合, 消息传递, 注意力机制

Abstract: Understanding a visual scene can not only identify a single object in isolation,but also get the interaction between different objects.Generating scene graph can obtain all the tuples(subject-predicate-object) and describe the object relationships inside an image,which is widely used in image understanding tasks.To solve the problem that the existing scene graph generation models use complicated structures with slow inference speed,a scene graph generation model combining attention mechanism and feature fusion with Factorizable Net structure was proposed.Firstly,a image is decomposed into subgraphs,where each subgraph contains several objects and their relationships.Then,the position and shape information is merged in the object features,and the attention mechanism is used to realize the message transmission between the object features and the subgraph features.Finally,the object classification and the relationship between the objects are inferred according to the object features and the subgraph features.The experimental results show that the accuracy of the visual relationship detection is 22.78% to 25.41%,and the accuracy of the scene graph generation is 16.39% to 22.75%,which is 1.2% and 1.8% higher than Factorizable Net on multiplevi-sual relationship detection datasets.Besides,the proposed model can perform object relationship detection task in 0.6 seconds with a GTX 1080Ti graphics.The results demonstrate that the number of image regions to be inferred is significantly reduced by using the subgraph structure.The feature fusion method and the attention mechanism are used to improve the performance of depth features,so the objects and their relationships can be predicted more quickly and accurately.Therefore,it solves the problem of poor timeliness and low accuracy in the traditional scene graph generation models.

Key words: Attention mechanism, Feature fusion, Message transmission, Scene graph, Visual relationship detection

中图分类号: 

  • TP391.4
[1]JOHNSON J,KRISHNA R,STARK M,et al.Image retrieval using scene graphs[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2015:3668-3678.
[2]CHANG A X,SAVVA M,MANNING C D.Learning spatial knowledge for text to 3d scene generation[C]//Conference on Empirical Methods in Natural Language Processing.2014:2028-2038.
[3]DAI B,ZHANG Y Q,LIN D H.Detecting visual relationships with deep relational net-works[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2017:3298-3308.
[4]XU D F,ZHU Y K,LEI F F,et al.Scene Graph Generation by Iterative Message Passing[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2017:3097-3106.
[5]LI Y K,QUYANG W L,WANG X G,et al.Scene graph generation from objects,phrases and region captions[C]//IEEE international Conference on Computer Vision.IEEE Computer Society,2017:1270-1279.
[6]LI Y K,QUYANG W L,WANG X G,et al.Factorizable Net:An Eficient Subgraph-based Framework for Scene Graph Gene-ration[C]//European Conference on Computer Vision.2018:346-363.
[7]LU C,KRISHNA R,BERNSTEIN M,et al.Visual relationship detection with language priors[C]//European Conference on Computer Vision.2016:852-869.
[8]KRISHNA R,ZHU Y K,GROTH O,et al.Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[9]REN S Q,HE K M,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems (NIPS).Palais des Conrès de Montréal,2015:91-99.
[10]GRIRSHICK R.Fast R-CNN[C]//IEEE International Confe-rence on Computer Vision (ICCV).IEEE,2015:1440-1448.
[11]HE K M,GIRSHICK R,GKIOXARI G,et al.Mask R-CNN[J].IEEE Transactions on Pattern Analysis and Machine Intelligence PP,no.99 (2018):1.
[12]NAIR V,HINTON G E.Rectified linear units improve Restric-ted Boltzmann machines[C]//27th International Conference on Machine Learning.2010:807-814.
[13]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural ima-ge caption generation with visual attention[C]//32nd International Conference on Machine Learning.2015:2048-2057.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[3] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[4] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[5] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[8] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[10] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[11] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[12] 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚.
融合双向门控循环单元和注意力机制的软件自承认技术债识别方法
Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism
计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[13] 彭双, 伍江江, 陈浩, 杜春, 李军.
基于注意力神经网络的对地观测卫星星上自主任务规划方法
Satellite Onboard Observation Task Planning Based on Attention Neural Network
计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[14] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[15] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!