计算机科学 ›› 2020, Vol. 47 ›› Issue (11): 226-230.doi: 10.11896/jsjkx.191200015

• 人工智能 • 上一篇    下一篇

视觉问答中问题处理算法研究

徐胜, 祝永新   

  1. 中国科学院上海高等研究院 上海 201210
    中国科学院大学 北京 100049
  • 收稿日期:2019-12-02 修回日期:2020-03-20 出版日期:2020-11-15 发布日期:2020-11-05
  • 通讯作者: 祝永新(zhuyongxin@sari.ac.cn)
  • 作者简介:xusheng@sari.ac.cn
  • 基金资助:
    国家自然科学基金(U1831118);中国科学院战略性先导科技专项(XDA19000000,XDA19090106);上海市科学技术委员会科研计划项目(18511103502)

Study on Question Processing Algorithms in Visual Question Answering

XU Sheng, ZHU Yong-xin   

  1. Shanghai Advanced Research Institute,Chinese Academy of Sciences,Shanghai 201210,China
    University of Chinese Academy of Science,Beijing 100049,China
  • Received:2019-12-02 Revised:2020-03-20 Online:2020-11-15 Published:2020-11-05
  • About author:XU Sheng,born in 1993,postgraduate.His main research interests include deep learning and natural language processing.
    ZHU Yong-xin,born in 1969,Ph.D,researcher,is a member of China Computer Federation.His main research interests include computer system architecture,system-level chip design,big data,and artificial intelligence.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (U1831118),Strategic Leading Science and Technology Project of the Chinese Academy of Sciences (XDA19000000,XDA19090106) and Shanghai Committee Science and Technology Scientific Research Project (18511103502).

摘要: 当前对视觉问答(Visual Question Answering,VQA)建模的研究多种多样,但现有的VQA模型有一个共同的缺点:训练和推理较为耗时。研究表明,VQA模型中文本处理部分主要基于长短期记忆网络(Long Short Term Memory,LSTM),而VQA模型的整体性能也受制于文本处理部分的LSTM网络,由于LSTM网络具有循环的特性,LSTM网络中复杂的数据流难以有效利用GPU的并行计算优势来加速计算。针对以上问题,以优化模型的训练速度为目的,提出了一个新模型SCMP(Simple Conv1d MaxPool1d)来代替LSTM网络处理输入模型的自然语言文本。在VQA2.0数据集上的实验结果表明,该模型与现有的模型相比训练速度提高了10倍,并且没有对VQA模型的精度造成损失。此外,文中提出了一种新颖的方法来对VQA2.0数据集中的文本数据进行数据增强。实验结果表明,数据增强可以提高VQA模型的精度,同时加速模型收敛,使用增强后的数据训练的模型(SCMP)在验证集上的评估分数为63.46%,优于目前现存的VQA模型。

关键词: 长短期记忆网络, 词嵌入, 卷积神经网络, 视觉问答, 自然语言处理

Abstract: At present,there are various researches on the modeling of Visual Question Answering (VQA) tasks,but existing VQA models have a common drawback,i.e. training and reasoning are time-consuming.Research shows that the text processing part of the VQA model is mainly based on LSTM (Long Short Term Memory) networks,and the overall performance of the VQA model is also limited by the LSTM network used for the text processing.Due to the recurrent nature of the LSTM network,the complex data streams in the LSTM network can hardly take advantages of GPU parallel computing to accelerate.Aiming at the above problems,and for the purpose of optimizing the training speed of the model,a new model named SCMP (Simple Conv1d MaxPool1d) is proposed in this paper to replace the LSTM network to deal with incoming natural language questions.The experimental results on the VQA2.0 dataset show that the training speed of the model is 10 times faster than the existing model,and there is no loss for the accuracy of the VQA model.In addition,this paper proposes a novel method for data augmentation of question datasets in VQA2.0 datasets.Experimental results show that data augmentation can improve model prediction performance and accelerate model convergence.The model trained with enhanced data (SCMP) obtains an evaluation score of 63.46% on the validation set,which is better than the existing VQA model.

Key words: CNN, LSTM, Natural language processing, Visual question answering, Word embedding

中图分类号: 

  • TP391
[1] AGRAWAL A,LU J,ANTOL S,et al.VQA:Visual QuestionAnswering[J].International Journal of Computer Vision,2017,123(1):4-31.
[2] DESTA M T,CHEN L,KORNUTA T.Object-based reasoning in VQA[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV).IEEE,2018:1814-1823.
[3] TENEY D,ANDERSON P,HE X,et al.Tips and Tricks forVisual Question Answering:Learnings from the 2017 Challenge[C]//Computer Vision and Pattern Recognition.2018:4223-4232.
[4] REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961.
[5] NOH H,HONGSUCK SEO P,HAN B.Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:30-38.
[6] YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29.
[7] LU J,YANG J,BATRA D,et al.Hierarchical co-attention for visual question answering[J].Advances in Neural Information Processing Systems,2016:289-297.
[8] FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016.
[9] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[10] CHANG A X,MARTINI B,CULURCIELLO E,et al.Recurrent Neural Networks Hardware Implementation on FPGA[J].International Journal of Advanced Research in Electrical,Electronics and Instrumentation Engineering,2015,5(1):401-409.
[11] ZHANG L U,SHEN C L,LI S S.Emotion classification algorithm based on emotion-specific word vectors [J].Computer Science,2019,46(S1):93-97.
[12] MA L,LU Z,LI H,et al.Learning to answer questions from image using convolutional neural network[C]//National Conference on Artificial Intelligence.2016:3567-3573.
[13] PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:Global Vectors for Word Representation[C]//Empirical Methods in Natural Language Processing.2014:1532-1543.
[14] SHEN D,WANG G,WANG W,et al.Baseline needs more love:On simple word-embedding-based models and associated pooling mechanisms[J].arXiv:1805.09843,2018.
[15] SUN Z,ZHU Y,ZHENG Y,et al.FPGA Acceleration of LSTM Based on Data for Test Flight[C]//2018 IEEE International Conference on Smart Cloud (Smart Cloud).2018:1-6.
[16] YU A W,DOHAN D,LUONG M T,et al.Qanet:Combining local convolution with global self-attention for reading comprehension[J].arXiv:1804.09541,2018.
[17] GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[18] AGRAWAL A,KEMBHAVI A,BATRA D,et al.C-vqa:Acompositional split of the visual question answering (vqa) v1.0 dataset[J].arXiv:1704.08243,2017.
[19] SINGH J,YING V,NUTKIEWICZ A.Attention on attention:Architectures for visual question answering (vqa)[J].arXiv:1803.07724,2018.
[20] RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[C]//Advances in Neural Information Processing Systems.2018:1541-1551.
[1] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[3] 王馨彤, 王璇, 孙知信.
基于多尺度记忆残差网络的网络流量异常检测模型
Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network
计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[4] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[5] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[6] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[7] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[8] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[9] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[10] 刘月红, 牛少华, 神显豪.
基于卷积神经网络的虚拟现实视频帧内预测编码
Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network
计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179
[11] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[12] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[13] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[14] 赵冬梅, 吴亚星, 张红斌.
基于IPSO-BiLSTM的网络安全态势预测
Network Security Situation Prediction Based on IPSO-BiLSTM
计算机科学, 2022, 49(7): 357-362. https://doi.org/10.11896/jsjkx.210900103
[15] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!