计算机科学 ›› 2023, Vol. 50 ›› Issue (2): 123-129.doi: 10.11896/jsjkx.211200303
邹芸竹1, 杜圣东1,2, 滕飞1, 李天瑞1,2
ZOU Yunzhu1, DU Shengdong1,2, TENG Fei1, LI Tianrui1,2
摘要: 大数据时代,随着多源异构数据的爆炸式增长,多模态数据融合问题备受研究者的关注,其中视觉问答因需要图文协同处理而成为当前多模态数据融合研究的热点。视觉问答任务主要是对图像和文本两类模态数据进行特征关联与融合表示,最后进行推理学习给出结论。传统的视觉问答模型在特征融合时容易缺失模态关键信息,且大多数方法停留在数据之间浅层的特征关联表示学习,较少考虑深层的语义特征融合。针对上述问题,提出了一种基于图文特征跨模态深度交互的视觉问答模型。该模型利用卷积神经网络和长短时记忆网络分别获取图像和文本两种模态数据特征,然后利用元注意力单元组合建立的新型深度注意力学习网络,实现图文模态内部与模态之间的注意力特征交互式学习,最后对学习特征进行多模态融合表示并进行推理预测输出。在VQA-v2.0数据集上进行了模型实验和测试,结果表明,与基线模型相比,所提模型的性能有明显提升。
[1]WU A M,JIANG P,HAN Y H.Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J].Computer Science,2021,48(3):71-78. [2]SELVARAJU R R,COGSWELL M,DAS A,et al.Grad-cam:Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:618-626. [3]DU H J,LIU X L.Image description generation method based on inhibitor learning [J].Journal of Image and Graphics,2020,25(2):333-342. [4]XU S K,NI C H,JI C C,et al.Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 [J].Compu-ter Science,2020,47(8):233-240. [5]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216. [6]ZHOU Y X,YU J.Design of Image Question and Answer System Based on Deep Learning [J].Computer Application and Software,2018,35(12):199-208. [7]ZHUANG M Q,TAN X H,FAN Y C,et al.3D Animation Expression Generation and Emotional Supervision Based on Convolutional Neural Network [J].Journal of Chongqing University of Technology(Natural Science),2022,36(01):151-158. [8]XU S,ZHU Y X.Study on Question Processing Algorithms in Visual Question Answering [J].Computer Science,2020,47(11):226-230. [9]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [10]ZHOU B,TIAN Y,SUKHBAATAR S,et al.Simple baseline for visual question answering [J].arXiv:1512.02167,2015. [11]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input [J].Advances in Neural Information Processing Systems,2014,27:1682-1690. [12]KAFLE K,KANAN C.Answer-type prediction for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4976-4984. [13]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering [J].Advances in Neural Information Processing Systems,2015,28:2953-2961. [14]LIN M Q,ZHANG X M.Identity Authentication of Multi-Modal Fusion Based on Behavioral Footprint[J].Computer Engineering,2021,47(10):116-124. [15]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468. [16]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369. [17]MENG X S,JIANG A W,LIU C H,et al.Visual Question Answering based on Spatial-DCTHash Dynamic Parameter Network [J].SCIENTIA SINICA Informationis,2017,47(8):1008-1022. [18]GU L,JI Y,LIU C P.Classification Method of Three-Dimensional Point Cloud Based on Multiple Modal Feature Fusion[J].Computer Engineering,2021,47(2):279-284. [19]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering [J].Advances in Neural Information Processing Systems,2016,29:289-297. [20]NGUYEN D K,OKATANI T.Improved fusion of visual andlanguage representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096. [21]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830. [22]YAN R Y,LIU X L.Visual Question Answering Model Based on Bottom-up Attention and Memory Network [J].Journal of Image and Graphics,2020,25(5):993-1006. [23]WANG Y L,ZHUO Y F,WU Y J,et al.Question Answering Algorithm on Image Fragmentation Information Based on Deep Neural Network [J].Journal of Computer Research and Deve-lopment,2018,55(12):2600-2610. [24]CHEN C,HAN D,WANG J.Multimodal encoder-decoder attention networks for visual question answering [J].IEEE Access,2020,8:35662-35671. [25]FU P C,YANG G,LIU X M,et al.Visual Question Answering Model Based on Spatial Relation and Frequency Feature [J].Computer Engineering,2022,48(9):96-104. [26]ZOU P R,XIAO F,ZHANG W J,et al.Multi-Modele Co-Attention Network for Visual Question Answering [J].Computer Engineering,2022,48(2):250-260. [27]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149. [28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [J].arXiv:1706.03762,2017. [29]LI L.Research on Collaborative Attention Model and Deep Correlated Networks for Visual Question Answer [D].Xiamen:Huaqiao University,2020. [30]NIU Y L,ZHANG H W.Survey on Visual Question Answering and Dialogue [J].Computer Science,2021,48(3):87-96. [31]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answe-ring[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959. |
[1] | 李帅, 徐彬, 韩祎珂, 廖同鑫. SS-GCN:情感增强和句法增强的方面级情感分析模型 SS-GCN:Aspect-based Sentiment Analysis Model with Affective Enhancement and Syntactic Enhancement 计算机科学, 2023, 50(3): 3-11. https://doi.org/10.11896/jsjkx.220700238 |
[2] | 陈富强, 寇嘉敏, 苏利敏, 李克. 基于图神经网络的多信息优化实体对齐模型 Multi-information Optimized Entity Alignment Model Based on Graph Neural Network 计算机科学, 2023, 50(3): 34-41. https://doi.org/10.11896/jsjkx.220700242 |
[3] | 周明强, 代开浪, 吴全旺, 朱庆生. 异构信息网络的注意力感知多通道图卷积评分预测模型 Attention-aware Multi-channel Graph Convolutional Rating Prediction Model for Heterogeneous Information Networks 计算机科学, 2023, 50(3): 129-138. https://doi.org/10.11896/jsjkx.220300004 |
[4] | 董永峰, 黄港, 薛婉若, 李林昊. 融合IRT的图注意力深度知识追踪模型 Graph Attention Deep Knowledge Tracing Model Integrated with IRT 计算机科学, 2023, 50(3): 173-180. https://doi.org/10.11896/jsjkx.211200134 |
[5] | 华晓凤, 冯娜, 于俊清, 何云峰. 基于规则推理的足球视频任意球射门事件检测 Shooting Event Detection of Free Kick in Soccer Video Based on Rule Reasoning 计算机科学, 2023, 50(3): 181-190. https://doi.org/10.11896/jsjkx.220300062 |
[6] | 梅鹏程, 杨吉斌, 张强, 黄翔. 一种基于三维卷积的声学事件联合估计方法 Sound Event Joint Estimation Method Based on Three-dimension Convolution 计算机科学, 2023, 50(3): 191-198. https://doi.org/10.11896/jsjkx.220500259 |
[7] | 白雪飞, 马亚楠, 王文剑. 基于特征融合的边缘引导乳腺超声图像分割方法 Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion 计算机科学, 2023, 50(3): 199-207. https://doi.org/10.11896/jsjkx.211200294 |
[8] | 刘航, 普园媛, 吕大华, 赵征鹏, 徐丹, 钱文华. 极化自注意力约束颜色溢出的图像自动上色 Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image 计算机科学, 2023, 50(3): 208-215. https://doi.org/10.11896/jsjkx.220100149 |
[9] | 陈亮, 王璐, 李生春, 刘昌宏. 基于深度学习的可视化仪表板生成技术研究 Study on Visual Dashboard Generation Technology Based on Deep Learning 计算机科学, 2023, 50(3): 238-245. https://doi.org/10.11896/jsjkx.230100064 |
[10] | 张译, 吴秦. 特征增强损失与前景注意力人群计数网络 Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention 计算机科学, 2023, 50(3): 246-253. https://doi.org/10.11896/jsjkx.220100219 |
[11] | 冯程程, 刘派, 姜琳颖, 梅笑寒, 郭贵冰. 文档增强型知识库问答 Document-enhanced Question Answering over Knowledge-Bases 计算机科学, 2023, 50(3): 266-275. https://doi.org/10.11896/jsjkx.220300022 |
[12] | 应宗浩, 吴槟. 深度学习模型的后门攻击研究综述 Backdoor Attack on Deep Learning Models:A Survey 计算机科学, 2023, 50(3): 333-350. https://doi.org/10.11896/jsjkx.220600031 |
[13] | 王鹏宇, 台文鑫, 刘芳, 钟婷, 罗绪成, 周帆. 基于数据增强的自监督飞行航迹预测 Self-supervised Flight Trajectory Prediction Based on Data Augmentation 计算机科学, 2023, 50(2): 130-137. https://doi.org/10.11896/jsjkx.211200016 |
[14] | 郭楠, 李婧源, 任曦. 基于深度学习的刚体位姿估计方法综述 Survey of Rigid Object Pose Estimation Algorithms Based on Deep Learning 计算机科学, 2023, 50(2): 178-189. https://doi.org/10.11896/jsjkx.211200164 |
[15] | 李俊林, 欧阳智, 杜逆索. 基于改进区域候选网络的场景文本检测 Scene Text Detection with Improved Region Proposal Network 计算机科学, 2023, 50(2): 201-208. https://doi.org/10.11896/jsjkx.211000191 |