计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 85-91.doi: 10.11896/jsjkx.230300202

• 计算机图形学&多媒体 • 上一篇    下一篇

基于跨模态信息过滤的视觉问答网络

何世阳1, 王朝晖2, 龚声蓉1,3, 钟珊3   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州 215008
    2 苏州大学东吴学院 江苏 苏州 215006
    3 常熟理工学院计算机科学与工程学院 江苏 苏州 215500
  • 收稿日期:2023-03-26 修回日期:2023-08-09 出版日期:2024-05-15 发布日期:2024-05-08
  • 通讯作者: 龚声蓉(shrgong@suda.edu.cn)
  • 作者简介:(bujiayana@163.com)
  • 基金资助:
    国家自然科学基金(61972059,42071438);江苏省自然科学基金(BK20191474,BK20191475);吉林大学符号计算与知识工程教育部重点实验室(93K172021K01)

Cross-modal Information Filtering-based Networks for Visual Question Answering

HE Shiyang1, WANG Zhaohui2, GONG Shengrong1,3, ZHONG Shan3   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215008,China
    2 Soochow College,Soochow University,Suzhou,Jiangsu 215006,China
    3 School of Computer Science and Engineering,Changshu Institute of Technology,Suzhou,Jiangsu 215500,China
  • Received:2023-03-26 Revised:2023-08-09 Online:2024-05-15 Published:2024-05-08
  • About author:HE Shiyang,born in 1995,postgra-duate.His main research interests include machine learning and computer vision.
    GONG Shengrong,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include image and video processing,pattern recognition and computer vision.
  • Supported by:
    National Natural Science Foundation of China(61972059,42071438),Natural Science Foundation of Jiangsu Province,China(BK20191474,BK20191475) and Key Laboratory of Symbolic Computing and Knowledge Engineering, Ministry of Education,Jilin University(93K172021K01).

摘要: 视觉问答作为多模态任务,瓶颈在于需要解决不同模态间的融合问题,这不仅需要充分理解图像中的视觉和文本,还需具备对齐跨模态表示的能力。注意力机制的引入为多模态融合提供了有效的路径,然而先前的方法通常将提取的图像特征直接进行注意力计算,忽略了图像特征中含有噪声和不正确的信息这一问题,且多数方法局限于模态间的浅层交互,未曾考虑模态间的深层语义信息。为解决这一问题,提出了一个跨模态信息过滤网络,即首先以问题特征为监督信号,通过设计的信息过滤模块来过滤图像特征信息,使之更好地契合问题表征;随后将图像特征和问题特征送入跨模态交互层,在自注意力和引导注意力的作用下分别建模模态内和模态间的关系,以获取更细粒度的多模态特征。在VQA2.0数据集上进行了广泛的实验,实验结果表明,信息过滤模块的引入有效提升了模型准确率,在 test-std上的整体精度达到了71.51%,相比大多数先进的方法具有良好的性能。

关键词: 视觉问答, 深度学习, 注意力机制, 多模态融合, 信息过滤

Abstract: As a multi-modal task,the bottleneck of visual question answering(VQA) is to solve the problem of fusion between different modes.It requires not only a full understanding of vision and text in the image,but also the ability to align cross-modal representation.The introduction of the attention mechanism provides an effective path for multi-mode fusion.However,the pre-vious methods usually calculate the extracted image features directly,ignoring the noise and incorrect information contained in the image features,and most of the methods are limited to the shallow interaction between modes,without considering the deep semantic information between modes.To solve this problem,a cross-modal information filtering network(CIFN) is proposed.Firstly,the feature of the problem is taken as the supervision signal,and the information filtering module is designed to filter the feature information of the image,so that it can better fit the representation of the problem.Then the image features and problem features are sent to the cross-modal interaction layer,and the intra-modal and inter-modal relationships are modeled respectively under the action of self-attention and guided attention,so as to obtain more fine-grained multi-modal features.Extensive experiments have been conducted on VQA2.0 data sets,and the experimental results show that the introduction of information filtering mo-dule effectively improves the model accuracy,and the overall accuracy of test-std reaches 71.51%,which has good performance compared with the most advanced methods.

Key words: Visual question answering, Deep learning, Attention mechanism, Multi-modal fusion, Information filtering

中图分类号: 

  • TP391
[1]YAN F,MIKOLAJCZYK K.Deep correlation for matching images and text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2015:3441-3450.
[2]WANG Y,YANG H,QIAN X,et al.Position focused attentionnetwork for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann,2019:3792-3798.
[3]YOU Q,JIN H,WANG Z,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.New York:IEEE Press,2016:4651-4659.
[4]LI G,ZHU L,LIU P,et al.Entangled transf-ormer for imagecaptioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2019:8928-8937.
[5]NGUYEN K,TRIPATHI S,DU B,et al.In defense of scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2021:1407-1416.
[6]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2015:2425-2433.
[7]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2018:6077-6086.
[8]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2019:6281-6290.
[9]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2015:1-9.
[10]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:4613-4621.
[11]REN S,HE K,GIRSHICK R,et al.Faster rcnn:Towards real-time object detection with region proposal networks[C]//Advances in NeuralInformation Processing Systems 28.Cambridge:MIT Press,2015:91-99.
[12]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems 31.Cambridge:MIT Press,2018:1571-1581.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems 30.Cambridge:MIT Press,2017:5998-6008.
[14]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:21-29.
[15]LU P,LI H,ZHANG W,et al.Co-attending freeform regionsand detections with multi-modal multiplicative feature embedding for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park:AAAI Press,2018:7218-7225.
[16]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2017:1839-1848.
[17]ZHOU B,TIAN Y,SUKHBAATAR S,et al.Simple baseline for visual question answering[J].arXiv,2015,1512.02167.
[18]SCHWARTZ I,SCHWING A,HAZAN T.High-orderattention models for visual question answering[C]//Advances in Neural Information Processing Systems 30.Cambridge:MIT Press,2017:3664-3674.
[19]BENYOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[20]NAM H,HA J W,KIM J.Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern recognition.New York:IEEE Press,2017:299-307.
[21]NGUYEN D K,OKATANI T.Improved fusion ofvisual andlanguage representations by densesymmetric coattention for vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2018:6087-6096.
[22]FU P C,YANG G,LIU X M,et al.Visual Question Answering Model Based on Spatial Relation and Frequency Feature[J].Computer Engineering,2022,48(9):96-104.
[23]PENG L,YANG Y,BIN Y,et al.Word-to-region attention network for visual question answering[J].Multimedia Tools and Applications,2019,78:3843-3858.
[24]GUAN W,WU Z,PING W.Question-oriented cross-modal co-attention networks for visual question answering[C]//2022 2nd International Conference on Consumer Electronics and Compu-ter Engineering.New York:IEEE Press,2022:401-407.
[25]HOCHREITER S,SCHMIDHUBER J.Long short term memory[J].Neural Computation,1997,9(8):1735-1780.
[26]LI C,LI L,QI J.A self-attentive model with gate mechanism for spoken language understanding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL,2018:3824-3833.
[27]RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2021:1653-1662.
[28]KRISHNA R,ZHU Y,GROTH O,et al.Visual g-enome:Connecting language and vision using crowd sourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[29]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115:211-252.
[30]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:770-778.
[31]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL,2014:1532-1543.
[32]NGUYEN B X,DO T,TRAN H,et al.Coarse-to-fine reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2022:4558-4566.
[33]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2017:6904-6913.
[34]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision ECCV 2014:13th European Conference.Berlin:Springer,2014:740-755.
[35]KINGMA D P,BA J.Adam:A method for stochasticoptimiza-tion[C]//3rd International Conference on Learning Representations.Ithaca,2015.
[36]KIM W,SON B,KIM I.Vilt:Vision and language transformer without convolution or region supervision[C]//International Conference on Machine Learning.New York:ACM,2021:5583-5594.
[37]QIAN Y,HU Y,WANG R,et al.Question Driven Graph Fusion Network For Visual Question Answering[C]//2022 IEEE International Conference on Multimedia and Expo.New York:IEEE Press,2022:1-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!