Computer Science ›› 2020, Vol. 47 ›› Issue (11): 226-230.doi: 10.11896/jsjkx.191200015

• Artificial Intelligence • Previous Articles     Next Articles

Study on Question Processing Algorithms in Visual Question Answering

XU Sheng, ZHU Yong-xin   

  1. Shanghai Advanced Research Institute,Chinese Academy of Sciences,Shanghai 201210,China
    University of Chinese Academy of Science,Beijing 100049,China
  • Received:2019-12-02 Revised:2020-03-20 Online:2020-11-15 Published:2020-11-05
  • About author:XU Sheng,born in 1993,postgraduate.His main research interests include deep learning and natural language processing.
    ZHU Yong-xin,born in 1969,Ph.D,researcher,is a member of China Computer Federation.His main research interests include computer system architecture,system-level chip design,big data,and artificial intelligence.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (U1831118),Strategic Leading Science and Technology Project of the Chinese Academy of Sciences (XDA19000000,XDA19090106) and Shanghai Committee Science and Technology Scientific Research Project (18511103502).

Abstract: At present,there are various researches on the modeling of Visual Question Answering (VQA) tasks,but existing VQA models have a common drawback,i.e. training and reasoning are time-consuming.Research shows that the text processing part of the VQA model is mainly based on LSTM (Long Short Term Memory) networks,and the overall performance of the VQA model is also limited by the LSTM network used for the text processing.Due to the recurrent nature of the LSTM network,the complex data streams in the LSTM network can hardly take advantages of GPU parallel computing to accelerate.Aiming at the above problems,and for the purpose of optimizing the training speed of the model,a new model named SCMP (Simple Conv1d MaxPool1d) is proposed in this paper to replace the LSTM network to deal with incoming natural language questions.The experimental results on the VQA2.0 dataset show that the training speed of the model is 10 times faster than the existing model,and there is no loss for the accuracy of the VQA model.In addition,this paper proposes a novel method for data augmentation of question datasets in VQA2.0 datasets.Experimental results show that data augmentation can improve model prediction performance and accelerate model convergence.The model trained with enhanced data (SCMP) obtains an evaluation score of 63.46% on the validation set,which is better than the existing VQA model.

Key words: CNN, LSTM, Natural language processing, Visual question answering, Word embedding

CLC Number: 

  • TP391
[1] AGRAWAL A,LU J,ANTOL S,et al.VQA:Visual QuestionAnswering[J].International Journal of Computer Vision,2017,123(1):4-31.
[2] DESTA M T,CHEN L,KORNUTA T.Object-based reasoning in VQA[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV).IEEE,2018:1814-1823.
[3] TENEY D,ANDERSON P,HE X,et al.Tips and Tricks forVisual Question Answering:Learnings from the 2017 Challenge[C]//Computer Vision and Pattern Recognition.2018:4223-4232.
[4] REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961.
[5] NOH H,HONGSUCK SEO P,HAN B.Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:30-38.
[6] YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29.
[7] LU J,YANG J,BATRA D,et al.Hierarchical co-attention for visual question answering[J].Advances in Neural Information Processing Systems,2016:289-297.
[8] FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016.
[9] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[10] CHANG A X,MARTINI B,CULURCIELLO E,et al.Recurrent Neural Networks Hardware Implementation on FPGA[J].International Journal of Advanced Research in Electrical,Electronics and Instrumentation Engineering,2015,5(1):401-409.
[11] ZHANG L U,SHEN C L,LI S S.Emotion classification algorithm based on emotion-specific word vectors [J].Computer Science,2019,46(S1):93-97.
[12] MA L,LU Z,LI H,et al.Learning to answer questions from image using convolutional neural network[C]//National Conference on Artificial Intelligence.2016:3567-3573.
[13] PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:Global Vectors for Word Representation[C]//Empirical Methods in Natural Language Processing.2014:1532-1543.
[14] SHEN D,WANG G,WANG W,et al.Baseline needs more love:On simple word-embedding-based models and associated pooling mechanisms[J].arXiv:1805.09843,2018.
[15] SUN Z,ZHU Y,ZHENG Y,et al.FPGA Acceleration of LSTM Based on Data for Test Flight[C]//2018 IEEE International Conference on Smart Cloud (Smart Cloud).2018:1-6.
[16] YU A W,DOHAN D,LUONG M T,et al.Qanet:Combining local convolution with global self-attention for reading comprehension[J].arXiv:1804.09541,2018.
[17] GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[18] AGRAWAL A,KEMBHAVI A,BATRA D,et al.C-vqa:Acompositional split of the visual question answering (vqa) v1.0 dataset[J].arXiv:1704.08243,2017.
[19] SINGH J,YING V,NUTKIEWICZ A.Attention on attention:Architectures for visual question answering (vqa)[J].arXiv:1803.07724,2018.
[20] RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[C]//Advances in Neural Information Processing Systems.2018:1541-1551.
[1] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[2] ZHANG Yuan, KANG Le, GONG Zhao-hui, ZHANG Zhi-hong. Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM [J]. Computer Science, 2022, 49(7): 31-39.
[3] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[4] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[5] ZHU Wen-tao, LAN Xian-chao, LUO Huan-lin, YUE Bing, WANG Yang. Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN [J]. Computer Science, 2022, 49(6A): 378-383.
[6] YUE Qing, YIN Jian-yu, WANG Sheng-sheng. Automatic Detection of Pulmonary Nodules in Low-dose CT Images Based on Improved CNN [J]. Computer Science, 2022, 49(6A): 54-59.
[7] LI Xiao-wei, SHU Hui, GUANG Yan, ZHAI Yi, YANG Zi-ji. Survey of the Application of Natural Language Processing for Resume Analysis [J]. Computer Science, 2022, 49(6A): 66-73.
[8] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[9] YU Ben-gong, ZHANG Zi-wei, WANG Hui-ling. TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information [J]. Computer Science, 2022, 49(6A): 165-171.
[10] WANG Shan, XU Chu-yi, SHI Chun-xiang, ZHANG Ying. Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM [J]. Computer Science, 2022, 49(6A): 675-679.
[11] ZHAO Zheng-peng, LI Jun-gang, PU Yuan-yuan. Low-light Image Enhancement Based on Retinex Theory by Convolutional Neural Network [J]. Computer Science, 2022, 49(6): 199-209.
[12] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[13] DING Feng, SUN Xiao. Negative-emotion Opinion Target Extraction Based on Attention and BiLSTM-CRF [J]. Computer Science, 2022, 49(2): 223-230.
[14] LI Yu-qiang, ZHANG Wei-jiang, HUANG Yu, LI Lin, LIU Ai-hua. Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution [J]. Computer Science, 2022, 49(2): 256-264.
[15] ZHANG Hu, BAI Ping. Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification [J]. Computer Science, 2022, 49(2): 279-284.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!