一种基于多模态深度特征融合的视觉问答模型

doi:10.11896/jsjkx.211200303

计算机科学 ›› 2023, Vol. 50 ›› Issue (2): 123-129.doi: 10.11896/jsjkx.211200303

• 数据库&大数据&数据科学 • 上一篇下一篇

一种基于多模态深度特征融合的视觉问答模型

邹芸竹¹, 杜圣东^1,2, 滕飞¹, 李天瑞^1,2

1 西南交通大学计算机与人工智能学院成都 611756
2 综合交通大数据应用技术国家工程实验室成都 611756

收稿日期:2021-12-28 修回日期:2022-06-26 出版日期:2023-02-15 发布日期:2023-02-22
通讯作者: 杜圣东(sddu@swjtu.edu.cn)
作者简介:(zyz590@my.swjtu.edu.cn)
基金资助:
国家科技重大专项(2020AAA0105101)

Visual Question Answering Model Based on Multi-modal Deep Feature Fusion

ZOU Yunzhu¹, DU Shengdong^1,2, TENG Fei¹, LI Tianrui^1,2

1 Institute of Computer and Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China
2 National Engineering Laboratory of Integrated Transportation Big Data Application Technology,Chengdu 611756,China

Received:2021-12-28 Revised:2022-06-26 Online:2023-02-15 Published:2023-02-22
Supported by:
National Science and Technology Major Project of the Ministry of Science and Technology of China(2020AAA0105101)

摘要/Abstract

摘要： 大数据时代,随着多源异构数据的爆炸式增长,多模态数据融合问题备受研究者的关注,其中视觉问答因需要图文协同处理而成为当前多模态数据融合研究的热点。视觉问答任务主要是对图像和文本两类模态数据进行特征关联与融合表示,最后进行推理学习给出结论。传统的视觉问答模型在特征融合时容易缺失模态关键信息,且大多数方法停留在数据之间浅层的特征关联表示学习,较少考虑深层的语义特征融合。针对上述问题,提出了一种基于图文特征跨模态深度交互的视觉问答模型。该模型利用卷积神经网络和长短时记忆网络分别获取图像和文本两种模态数据特征,然后利用元注意力单元组合建立的新型深度注意力学习网络,实现图文模态内部与模态之间的注意力特征交互式学习,最后对学习特征进行多模态融合表示并进行推理预测输出。在VQA-v2.0数据集上进行了模型实验和测试,结果表明,与基线模型相比,所提模型的性能有明显提升。

关键词: 视觉问答, 多模态特征融合, 注意力机制, 深度学习, 数据融合

Abstract: In the era of big data,with the explosive growth of multi-source heterogeneous data,multi-modal data fusion has attracted much attention of researchers,and visual question answering(VQA) has become a hot topic in multi-modal data fusion due to its image and text fusion processing characteristics.Visual Q&A task is mainly based on the deep feature fusion association and representation of image and text multi-modal data,and inference learning of the fusion feature results,so as to get the conclusion.Traditional visual question answering models tend to miss key information and mostly focus on the superficial modal feature association representation learning between data,but less on the deep semantic feature fusion.To solve the above pro-blems,this paper proposes a visual question answering model based on cross-modal deep interaction of of graphic features.The proposed method uses convolutional neural network and LSTM network to obtain the data features of image and text modes respectively,and builds a novel deep attention learning network based on combination of meta-attention units,to realize interactive learning of attention features within or between modes of image and text.At last,we represent the learning features so as to output the results.The model is tested and evaluated on VQA-v2.0 dataset.Compared with the traditional baseline model,the expe-rimental results show that the performance of the proposed model is significantly improved.

Key words: Visual question answering, Multi-modal feature fusion, Attention mechanism, Deep learning, Data fusion

中图分类号:

TP391.41

邹芸竹, 杜圣东, 滕飞, 李天瑞. 一种基于多模态深度特征融合的视觉问答模型[J]. 计算机科学, 2023, 50(2): 123-129. https://doi.org/10.11896/jsjkx.211200303

ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui. Visual Question Answering Model Based on Multi-modal Deep Feature Fusion[J]. Computer Science, 2023, 50(2): 123-129. https://doi.org/10.11896/jsjkx.211200303

参考文献

[1]WU A M,JIANG P,HAN Y H.Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J].Computer Science,2021,48(3):71-78.
[2]SELVARAJU R R,COGSWELL M,DAS A,et al.Grad-cam:Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:618-626.
[3]DU H J,LIU X L.Image description generation method based on inhibitor learning [J].Journal of Image and Graphics,2020,25(2):333-342.
[4]XU S K,NI C H,JI C C,et al.Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 [J].Compu-ter Science,2020,47(8):233-240.
[5]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216.
[6]ZHOU Y X,YU J.Design of Image Question and Answer System Based on Deep Learning [J].Computer Application and Software,2018,35(12):199-208.
[7]ZHUANG M Q,TAN X H,FAN Y C,et al.3D Animation Expression Generation and Emotional Supervision Based on Convolutional Neural Network [J].Journal of Chongqing University of Technology(Natural Science),2022,36(01):151-158.
[8]XU S,ZHU Y X.Study on Question Processing Algorithms in Visual Question Answering [J].Computer Science,2020,47(11):226-230.
[9]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[10]ZHOU B,TIAN Y,SUKHBAATAR S,et al.Simple baseline for visual question answering [J].arXiv:1512.02167,2015.
[11]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input [J].Advances in Neural Information Processing Systems,2014,27:1682-1690.
[12]KAFLE K,KANAN C.Answer-type prediction for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4976-4984.
[13]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering [J].Advances in Neural Information Processing Systems,2015,28:2953-2961.
[14]LIN M Q,ZHANG X M.Identity Authentication of Multi-Modal Fusion Based on Behavioral Footprint[J].Computer Engineering,2021,47(10):116-124.
[15]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[16]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369.
[17]MENG X S,JIANG A W,LIU C H,et al.Visual Question Answering based on Spatial-DCTHash Dynamic Parameter Network [J].SCIENTIA SINICA Informationis,2017,47(8):1008-1022.
[18]GU L,JI Y,LIU C P.Classification Method of Three-Dimensional Point Cloud Based on Multiple Modal Feature Fusion[J].Computer Engineering,2021,47(2):279-284.
[19]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering [J].Advances in Neural Information Processing Systems,2016,29:289-297.
[20]NGUYEN D K,OKATANI T.Improved fusion of visual andlanguage representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096.
[21]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[22]YAN R Y,LIU X L.Visual Question Answering Model Based on Bottom-up Attention and Memory Network [J].Journal of Image and Graphics,2020,25(5):993-1006.
[23]WANG Y L,ZHUO Y F,WU Y J,et al.Question Answering Algorithm on Image Fragmentation Information Based on Deep Neural Network [J].Journal of Computer Research and Deve-lopment,2018,55(12):2600-2610.
[24]CHEN C,HAN D,WANG J.Multimodal encoder-decoder attention networks for visual question answering [J].IEEE Access,2020,8:35662-35671.
[25]FU P C,YANG G,LIU X M,et al.Visual Question Answering Model Based on Spatial Relation and Frequency Feature [J].Computer Engineering,2022,48(9):96-104.
[26]ZOU P R,XIAO F,ZHANG W J,et al.Multi-Modele Co-Attention Network for Visual Question Answering [J].Computer Engineering,2022,48(2):250-260.
[27]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [J].arXiv:1706.03762,2017.
[29]LI L.Research on Collaborative Attention Model and Deep Correlated Networks for Visual Question Answer [D].Xiamen:Huaqiao University,2020.
[30]NIU Y L,ZHANG H W.Survey on Visual Question Answering and Dialogue [J].Computer Science,2021,48(3):87-96.
[31]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answe-ring[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.

相关文章 15

[1]	李帅, 徐彬, 韩祎珂, 廖同鑫. SS-GCN:情感增强和句法增强的方面级情感分析模型 SS-GCN:Aspect-based Sentiment Analysis Model with Affective Enhancement and Syntactic Enhancement 计算机科学, 2023, 50(3): 3-11. https://doi.org/10.11896/jsjkx.220700238
[2]	陈富强, 寇嘉敏, 苏利敏, 李克. 基于图神经网络的多信息优化实体对齐模型 Multi-information Optimized Entity Alignment Model Based on Graph Neural Network 计算机科学, 2023, 50(3): 34-41. https://doi.org/10.11896/jsjkx.220700242
[3]	周明强, 代开浪, 吴全旺, 朱庆生. 异构信息网络的注意力感知多通道图卷积评分预测模型 Attention-aware Multi-channel Graph Convolutional Rating Prediction Model for Heterogeneous Information Networks 计算机科学, 2023, 50(3): 129-138. https://doi.org/10.11896/jsjkx.220300004
[4]	董永峰, 黄港, 薛婉若, 李林昊. 融合IRT的图注意力深度知识追踪模型 Graph Attention Deep Knowledge Tracing Model Integrated with IRT 计算机科学, 2023, 50(3): 173-180. https://doi.org/10.11896/jsjkx.211200134
[5]	华晓凤, 冯娜, 于俊清, 何云峰. 基于规则推理的足球视频任意球射门事件检测 Shooting Event Detection of Free Kick in Soccer Video Based on Rule Reasoning 计算机科学, 2023, 50(3): 181-190. https://doi.org/10.11896/jsjkx.220300062
[6]	梅鹏程, 杨吉斌, 张强, 黄翔. 一种基于三维卷积的声学事件联合估计方法 Sound Event Joint Estimation Method Based on Three-dimension Convolution 计算机科学, 2023, 50(3): 191-198. https://doi.org/10.11896/jsjkx.220500259
[7]	白雪飞, 马亚楠, 王文剑. 基于特征融合的边缘引导乳腺超声图像分割方法 Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion 计算机科学, 2023, 50(3): 199-207. https://doi.org/10.11896/jsjkx.211200294
[8]	刘航, 普园媛, 吕大华, 赵征鹏, 徐丹, 钱文华. 极化自注意力约束颜色溢出的图像自动上色 Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image 计算机科学, 2023, 50(3): 208-215. https://doi.org/10.11896/jsjkx.220100149
[9]	陈亮, 王璐, 李生春, 刘昌宏. 基于深度学习的可视化仪表板生成技术研究 Study on Visual Dashboard Generation Technology Based on Deep Learning 计算机科学, 2023, 50(3): 238-245. https://doi.org/10.11896/jsjkx.230100064
[10]	张译, 吴秦. 特征增强损失与前景注意力人群计数网络 Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention 计算机科学, 2023, 50(3): 246-253. https://doi.org/10.11896/jsjkx.220100219
[11]	冯程程, 刘派, 姜琳颖, 梅笑寒, 郭贵冰. 文档增强型知识库问答 Document-enhanced Question Answering over Knowledge-Bases 计算机科学, 2023, 50(3): 266-275. https://doi.org/10.11896/jsjkx.220300022
[12]	应宗浩, 吴槟. 深度学习模型的后门攻击研究综述 Backdoor Attack on Deep Learning Models:A Survey 计算机科学, 2023, 50(3): 333-350. https://doi.org/10.11896/jsjkx.220600031
[13]	王鹏宇, 台文鑫, 刘芳, 钟婷, 罗绪成, 周帆. 基于数据增强的自监督飞行航迹预测 Self-supervised Flight Trajectory Prediction Based on Data Augmentation 计算机科学, 2023, 50(2): 130-137. https://doi.org/10.11896/jsjkx.211200016
[14]	郭楠, 李婧源, 任曦. 基于深度学习的刚体位姿估计方法综述 Survey of Rigid Object Pose Estimation Algorithms Based on Deep Learning 计算机科学, 2023, 50(2): 178-189. https://doi.org/10.11896/jsjkx.211200164
[15]	李俊林, 欧阳智, 杜逆索. 基于改进区域候选网络的场景文本检测 Scene Text Detection with Improved Region Proposal Network 计算机科学, 2023, 50(2): 201-208. https://doi.org/10.11896/jsjkx.211000191

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种基于多模态深度特征融合的视觉问答模型

Visual Question Answering Model Based on Multi-modal Deep Feature Fusion

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0