计算机科学 ›› 2020, Vol. 47 ›› Issue (11): 226-230.doi: 10.11896/jsjkx.191200015
徐胜, 祝永新
XU Sheng, ZHU Yong-xin
摘要: 当前对视觉问答(Visual Question Answering,VQA)建模的研究多种多样,但现有的VQA模型有一个共同的缺点:训练和推理较为耗时。研究表明,VQA模型中文本处理部分主要基于长短期记忆网络(Long Short Term Memory,LSTM),而VQA模型的整体性能也受制于文本处理部分的LSTM网络,由于LSTM网络具有循环的特性,LSTM网络中复杂的数据流难以有效利用GPU的并行计算优势来加速计算。针对以上问题,以优化模型的训练速度为目的,提出了一个新模型SCMP(Simple Conv1d MaxPool1d)来代替LSTM网络处理输入模型的自然语言文本。在VQA2.0数据集上的实验结果表明,该模型与现有的模型相比训练速度提高了10倍,并且没有对VQA模型的精度造成损失。此外,文中提出了一种新颖的方法来对VQA2.0数据集中的文本数据进行数据增强。实验结果表明,数据增强可以提高VQA模型的精度,同时加速模型收敛,使用增强后的数据训练的模型(SCMP)在验证集上的评估分数为63.46%,优于目前现存的VQA模型。
中图分类号:
[1] AGRAWAL A,LU J,ANTOL S,et al.VQA:Visual QuestionAnswering[J].International Journal of Computer Vision,2017,123(1):4-31. [2] DESTA M T,CHEN L,KORNUTA T.Object-based reasoning in VQA[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV).IEEE,2018:1814-1823. [3] TENEY D,ANDERSON P,HE X,et al.Tips and Tricks forVisual Question Answering:Learnings from the 2017 Challenge[C]//Computer Vision and Pattern Recognition.2018:4223-4232. [4] REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961. [5] NOH H,HONGSUCK SEO P,HAN B.Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:30-38. [6] YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29. [7] LU J,YANG J,BATRA D,et al.Hierarchical co-attention for visual question answering[J].Advances in Neural Information Processing Systems,2016:289-297. [8] FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016. [9] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [10] CHANG A X,MARTINI B,CULURCIELLO E,et al.Recurrent Neural Networks Hardware Implementation on FPGA[J].International Journal of Advanced Research in Electrical,Electronics and Instrumentation Engineering,2015,5(1):401-409. [11] ZHANG L U,SHEN C L,LI S S.Emotion classification algorithm based on emotion-specific word vectors [J].Computer Science,2019,46(S1):93-97. [12] MA L,LU Z,LI H,et al.Learning to answer questions from image using convolutional neural network[C]//National Conference on Artificial Intelligence.2016:3567-3573. [13] PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:Global Vectors for Word Representation[C]//Empirical Methods in Natural Language Processing.2014:1532-1543. [14] SHEN D,WANG G,WANG W,et al.Baseline needs more love:On simple word-embedding-based models and associated pooling mechanisms[J].arXiv:1805.09843,2018. [15] SUN Z,ZHU Y,ZHENG Y,et al.FPGA Acceleration of LSTM Based on Data for Test Flight[C]//2018 IEEE International Conference on Smart Cloud (Smart Cloud).2018:1-6. [16] YU A W,DOHAN D,LUONG M T,et al.Qanet:Combining local convolution with global self-attention for reading comprehension[J].arXiv:1804.09541,2018. [17] GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913. [18] AGRAWAL A,KEMBHAVI A,BATRA D,et al.C-vqa:Acompositional split of the visual question answering (vqa) v1.0 dataset[J].arXiv:1704.08243,2017. [19] SINGH J,YING V,NUTKIEWICZ A.Attention on attention:Architectures for visual question answering (vqa)[J].arXiv:1803.07724,2018. [20] RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[C]//Advances in Neural Information Processing Systems.2018:1541-1551. |
[1] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[2] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[3] | 王馨彤, 王璇, 孙知信. 基于多尺度记忆残差网络的网络流量异常检测模型 Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network 计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011 |
[4] | 陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121 |
[5] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[6] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[7] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[8] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
[9] | 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105 |
[10] | 刘月红, 牛少华, 神显豪. 基于卷积神经网络的虚拟现实视频帧内预测编码 Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network 计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179 |
[11] | 徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085 |
[12] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[13] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[14] | 赵冬梅, 吴亚星, 张红斌. 基于IPSO-BiLSTM的网络安全态势预测 Network Security Situation Prediction Based on IPSO-BiLSTM 计算机科学, 2022, 49(7): 357-362. https://doi.org/10.11896/jsjkx.210900103 |
[15] | 张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构 Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer 计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023 |
|