计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 145-149.doi: 10.11896/jsjkx.200800207

• 计算机图形学& 多媒体 • 上一篇    下一篇

利用全局与局部帧级特征进行基于共享注意力的视频问答

王雷全1, 候文艳2, 袁韶祖1, 赵欣2, 林瑶2, 吴春雷1   

  1. 1 中国石油大学(华东)计算机科学与技术学院 山东 青岛266555
    2 中国石油大学(华东)海洋与空间信息学院 山东 青岛266555
  • 收稿日期:2020-08-29 修回日期:2020-09-30 发布日期:2021-08-10
  • 通讯作者: 王雷全(richiewlq@gmail.com)
  • 基金资助:
    科技部重点研发计划(2018YFC1406204),中央高校基本科研业务费专项资金(19CX05003A-11)

Multi-Shared Attention with Global and Local Pathways for Video Question Answering

WANG Lei-quan1, HOU Wen-yan2, YUAN Shao-zu1, ZHAO Xin2, LIN Yao2, WU Chun-lei1   

  1. 1 College of Computer Science and Technology,China University of Petroleum,Qingdao,Shandong 266555,China;
    2 College of Oceanography and Space Informatics,China University of Petroleum,Qingdao,Shandong 266555,China
  • Received:2020-08-29 Revised:2020-09-30 Published:2021-08-10
  • About author:WANG Lei-quan,born in 1981,Ph.D,senior experimenter,is a member of China Computer Federation.His main research interests include cross media analysis and action recognition.
  • Supported by:
    National Key Research and Development Program(2018YFC1406204) and Fundamental Research Funds for the Central Universities(19CX05003A-11).

摘要: 视频问答是视觉理解领域中非常重要且具有挑战性的任务。目前的视觉问答(VQA)方法主要关注单个静态图片的问答,而现实生活中的数据是立体动态的视频。 此外,由于问题的复杂性,视频问答任务必须根据问答问题恰当地处理多种视觉特征才能获得高质量的答案。文中提出了一个通过利用局部和全局帧级别的视觉信息来进行视频问答的多共享注意力网络。具体来说,以不同帧率提取视频帧,并以此提取帧级的全局与局部视觉特征,这两种特征包含了多个帧级别特征,用于对视频时间动态建模,再以共享注意力的形式建模全局与局部视觉特征的相关性,然后结合文本问题来推断答案。在天池视频问答数据集上进行了大量的实验,验证了所提方法的有效性。

关键词: 视频问答, 共享注意力机制, 全局和局部帧级特征

Abstract: Video question answering is a challenging task of significant importance toward visual understanding.However,current visual question answering (VQA) methods mainly focus on a single static image,which is distinct from the sequential visual data we faced in the real world.In addition,due to the diversity of textual questions,the VideoQA task has to deal with various visual features to obtain the answers.This paper presents a multi-shared attention network by utilizing local and global frame-level visualinformation for video question answering (VideoQA).Specifically,a two-pathway model is proposed to capture the global and local frame-level features with different frame rates.The two pathways are fused together with the multi-shared attention by sharing the same attention funtion.Extensive experiments are conducted on Tianchi VideoQA dataset to validate the effectiveness of the proposed method.

Key words: Video question answering, Shared attention mechanism, Global and local pathways

中图分类号: 

  • TP391
[1]WU C,WEI Y,CHU X,et al.Hierarchical attention-based multimodal fusion for video captioning[J].Neurocomputing,2018,315:362-370.
[2]XU Z L,DONG H W.Video Question Answering Scheme Based on Prior MASK Attention Mechanism[J].Computer Enginee-ring,2021,47(2):52-59.
[3]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision.Cham:Springer,2016:451-466.
[4]XIONG C,ZHONG V,SOCHER R.Dynamic coattention net-works for question answering[J].arXiv:1611.01604,2016.
[5]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Advances in Neural Information Processing Systems.2016:289-297.
[6]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016.
[7]KIM K M,CHOI S H,KIM J H,et al.Multimodal dual attention memory for video story question answering[C]//Procee-dings of the European Conference on Computer Vision (ECCV).2018:673-688.
[8]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[9]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[10]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[11]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3165-3173.
[12]KIM K M,HEO M O,CHOI S H,et al.Deepstory:Video story qa by deep embedded memory networks[J].arXiv:1707.00836,2017.
[13]NA S,LEE S,KIM J,et al.A read-write memory network for movie story understanding[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:677-685.
[14]GAO L,ZENG P,SONG J,et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6391-6398.
[15]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2758-2766.
[16]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543.
[17]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[18]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[19]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015:91-99.
[20]JABRI A,JOULIN A,LAURENS V D M.Revisiting visualquestion answering baselines[C]//European Conference on Computer Vision.Cham:Springer,2016:727-739.
[21]IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[J].ar-Xiv:1502.03167,2015.
[22]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Internatio-nal Conference on Machine Learning.2015:2048-2057.
[23]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1564-1574.
[1] 陈天荣, 凌捷. 基于特征映射的差分隐私保护机器学习方法[J]. 计算机科学, 2021, 48(7): 33-39.
[2] 羊洋, 陈伟, 张丹懿, 王丹妮, 宋爽. 对抗攻击威胁基于卷积神经网络的网络流量分类[J]. 计算机科学, 2021, 48(7): 55-61.
[3] 张仁杰, 陈伟, 杭梦鑫, 吴礼发. 基于变分自编码器的不平衡样本异常流量检测[J]. 计算机科学, 2021, 48(7): 62-69.
[4] 邢豪, 李明. 基于3D CNNS的深度伪造视频篡改检测[J]. 计算机科学, 2021, 48(7): 86-92.
[5] 谭琪, 张凤荔, 王婷, 王瑞锦, 周世杰. 融入结构度中心性的社交网络用户影响力评估算法[J]. 计算机科学, 2021, 48(7): 124-129.
[6] 陈静杰, 王琨. 不平衡油耗数据的区间预测方法[J]. 计算机科学, 2021, 48(7): 178-183.
[7] 陈志文, 王坤, 周广蕴, 王旭, 张晓丹, 朱虎明. 基于胶囊网络及其权重剪枝的SAR图像变化检测方法[J]. 计算机科学, 2021, 48(7): 190-198.
[8] 卿来云, 张建功, 苗军. 在线异常事件检测的时序建模[J]. 计算机科学, 2021, 48(7): 206-212.
[9] 李琳, 刘学亮, 赵烨, 纪平. 结合乐高滤波器和SSD的低光照图像融合检测方法[J]. 计算机科学, 2021, 48(7): 213-218.
[10] 何涛, 赵停, 徐鹤. 基于暗通道先验的单幅图像去雾新算法[J]. 计算机科学, 2021, 48(7): 219-224.
[11] 徐浩, 刘岳镭. 基于深度学习的无人机声音识别算法[J]. 计算机科学, 2021, 48(7): 225-232.
[12] 辛元雪, 史朋飞, 薛瑞阳. 基于区域提取与改进 LBP 特征的运动目标检测[J]. 计算机科学, 2021, 48(7): 233-237.
[13] 张丽倩, 李孟航, 高珊珊, 张彩明. 面向计算机辅助舌诊关键问题的解决方案综述[J]. 计算机科学, 2021, 48(7): 256-269.
[14] 尹云飞, 林跃江, 黄发良, 白翔宇. 基于趋势特征向量的火灾烟气流动与温度分布预测[J]. 计算机科学, 2021, 48(7): 299-307.
[15] 王英恺, 王青山. 能量收集无线通信系统中基于强化学习的能量分配策略[J]. 计算机科学, 2021, 48(7): 333-339.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[2] 施超,谢在鹏,柳晗,吕鑫. 基于稳定匹配的容器部署策略的优化[J]. 计算机科学, 2018, 45(4): 131 -136 .
[3] 余伟伟,谢承旺. 一种多策略混合的粒子群优化算法[J]. 计算机科学, 2018, 45(6A): 120 -123 .
[4] 付泰,杨力,王斌. 一种具有精确位姿的飞机CAD模型投影图像库设计方法[J]. 计算机科学, 2018, 45(6A): 244 -246 .
[5] 张滨, 乐嘉锦. 基于列存储的MapReduce分布式Hash连接算法[J]. 计算机科学, 2018, 45(6A): 471 -475 .
[6] 曾新,李晓伟,杨健. 基于数据规范化的co-location模式挖掘算法[J]. 计算机科学, 2018, 45(6A): 482 -486 .
[7] 薛艳, 武淑红, 王耀力. 基于SPIN的G语言系统模型的验证[J]. 计算机科学, 2018, 45(6A): 536 -540 .
[8] 吴鹏, 周杰, 陈姜高路. SOC水声信道模型及其计算方法研究[J]. 计算机科学, 2018, 45(8): 94 -99 .
[9] 刘宇, 杨百龙, 赵文强, 袁志华. 基于自适应块参照值的密文域可逆信息隐藏[J]. 计算机科学, 2018, 45(8): 151 -155 .
[10] 朱江, 马骁, 尹耀虎. 认知网中一种基于隐马尔可夫的多信道功率控制机制[J]. 计算机科学, 2018, 45(9): 156 -160 .