计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 145-149.doi: 10.11896/jsjkx.200800207
王雷全1, 候文艳2, 袁韶祖1, 赵欣2, 林瑶2, 吴春雷1
WANG Lei-quan1, HOU Wen-yan2, YUAN Shao-zu1, ZHAO Xin2, LIN Yao2, WU Chun-lei1
摘要: 视频问答是视觉理解领域中非常重要且具有挑战性的任务。目前的视觉问答(VQA)方法主要关注单个静态图片的问答,而现实生活中的数据是立体动态的视频。 此外,由于问题的复杂性,视频问答任务必须根据问答问题恰当地处理多种视觉特征才能获得高质量的答案。文中提出了一个通过利用局部和全局帧级别的视觉信息来进行视频问答的多共享注意力网络。具体来说,以不同帧率提取视频帧,并以此提取帧级的全局与局部视觉特征,这两种特征包含了多个帧级别特征,用于对视频时间动态建模,再以共享注意力的形式建模全局与局部视觉特征的相关性,然后结合文本问题来推断答案。在天池视频问答数据集上进行了大量的实验,验证了所提方法的有效性。
中图分类号:
[1]WU C,WEI Y,CHU X,et al.Hierarchical attention-based multimodal fusion for video captioning[J].Neurocomputing,2018,315:362-370. [2]XU Z L,DONG H W.Video Question Answering Scheme Based on Prior MASK Attention Mechanism[J].Computer Enginee-ring,2021,47(2):52-59. [3]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision.Cham:Springer,2016:451-466. [4]XIONG C,ZHONG V,SOCHER R.Dynamic coattention net-works for question answering[J].arXiv:1611.01604,2016. [5]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Advances in Neural Information Processing Systems.2016:289-297. [6]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016. [7]KIM K M,CHOI S H,KIM J H,et al.Multimodal dual attention memory for video story question answering[C]//Procee-dings of the European Conference on Computer Vision (ECCV).2018:673-688. [8]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29. [9]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [10]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [11]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3165-3173. [12]KIM K M,HEO M O,CHOI S H,et al.Deepstory:Video story qa by deep embedded memory networks[J].arXiv:1707.00836,2017. [13]NA S,LEE S,KIM J,et al.A read-write memory network for movie story understanding[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:677-685. [14]GAO L,ZENG P,SONG J,et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6391-6398. [15]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2758-2766. [16]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543. [17]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014. [18]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [19]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015:91-99. [20]JABRI A,JOULIN A,LAURENS V D M.Revisiting visualquestion answering baselines[C]//European Conference on Computer Vision.Cham:Springer,2016:727-739. [21]IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[J].ar-Xiv:1502.03167,2015. [22]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Internatio-nal Conference on Machine Learning.2015:2048-2057. [23]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1564-1574. |
[1] | 陈天荣, 凌捷. 基于特征映射的差分隐私保护机器学习方法[J]. 计算机科学, 2021, 48(7): 33-39. |
[2] | 羊洋, 陈伟, 张丹懿, 王丹妮, 宋爽. 对抗攻击威胁基于卷积神经网络的网络流量分类[J]. 计算机科学, 2021, 48(7): 55-61. |
[3] | 张仁杰, 陈伟, 杭梦鑫, 吴礼发. 基于变分自编码器的不平衡样本异常流量检测[J]. 计算机科学, 2021, 48(7): 62-69. |
[4] | 邢豪, 李明. 基于3D CNNS的深度伪造视频篡改检测[J]. 计算机科学, 2021, 48(7): 86-92. |
[5] | 谭琪, 张凤荔, 王婷, 王瑞锦, 周世杰. 融入结构度中心性的社交网络用户影响力评估算法[J]. 计算机科学, 2021, 48(7): 124-129. |
[6] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法[J]. 计算机科学, 2021, 48(7): 178-183. |
[7] | 陈志文, 王坤, 周广蕴, 王旭, 张晓丹, 朱虎明. 基于胶囊网络及其权重剪枝的SAR图像变化检测方法[J]. 计算机科学, 2021, 48(7): 190-198. |
[8] | 卿来云, 张建功, 苗军. 在线异常事件检测的时序建模[J]. 计算机科学, 2021, 48(7): 206-212. |
[9] | 李琳, 刘学亮, 赵烨, 纪平. 结合乐高滤波器和SSD的低光照图像融合检测方法[J]. 计算机科学, 2021, 48(7): 213-218. |
[10] | 何涛, 赵停, 徐鹤. 基于暗通道先验的单幅图像去雾新算法[J]. 计算机科学, 2021, 48(7): 219-224. |
[11] | 徐浩, 刘岳镭. 基于深度学习的无人机声音识别算法[J]. 计算机科学, 2021, 48(7): 225-232. |
[12] | 辛元雪, 史朋飞, 薛瑞阳. 基于区域提取与改进 LBP 特征的运动目标检测[J]. 计算机科学, 2021, 48(7): 233-237. |
[13] | 张丽倩, 李孟航, 高珊珊, 张彩明. 面向计算机辅助舌诊关键问题的解决方案综述[J]. 计算机科学, 2021, 48(7): 256-269. |
[14] | 尹云飞, 林跃江, 黄发良, 白翔宇. 基于趋势特征向量的火灾烟气流动与温度分布预测[J]. 计算机科学, 2021, 48(7): 299-307. |
[15] | 王英恺, 王青山. 能量收集无线通信系统中基于强化学习的能量分配策略[J]. 计算机科学, 2021, 48(7): 333-339. |
|