计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 85-91.doi: 10.11896/jsjkx.230300202
何世阳1, 王朝晖2, 龚声蓉1,3, 钟珊3
HE Shiyang1, WANG Zhaohui2, GONG Shengrong1,3, ZHONG Shan3
摘要: 视觉问答作为多模态任务,瓶颈在于需要解决不同模态间的融合问题,这不仅需要充分理解图像中的视觉和文本,还需具备对齐跨模态表示的能力。注意力机制的引入为多模态融合提供了有效的路径,然而先前的方法通常将提取的图像特征直接进行注意力计算,忽略了图像特征中含有噪声和不正确的信息这一问题,且多数方法局限于模态间的浅层交互,未曾考虑模态间的深层语义信息。为解决这一问题,提出了一个跨模态信息过滤网络,即首先以问题特征为监督信号,通过设计的信息过滤模块来过滤图像特征信息,使之更好地契合问题表征;随后将图像特征和问题特征送入跨模态交互层,在自注意力和引导注意力的作用下分别建模模态内和模态间的关系,以获取更细粒度的多模态特征。在VQA2.0数据集上进行了广泛的实验,实验结果表明,信息过滤模块的引入有效提升了模型准确率,在 test-std上的整体精度达到了71.51%,相比大多数先进的方法具有良好的性能。
中图分类号:
[1]YAN F,MIKOLAJCZYK K.Deep correlation for matching images and text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2015:3441-3450. [2]WANG Y,YANG H,QIAN X,et al.Position focused attentionnetwork for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann,2019:3792-3798. [3]YOU Q,JIN H,WANG Z,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.New York:IEEE Press,2016:4651-4659. [4]LI G,ZHU L,LIU P,et al.Entangled transf-ormer for imagecaptioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2019:8928-8937. [5]NGUYEN K,TRIPATHI S,DU B,et al.In defense of scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2021:1407-1416. [6]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2015:2425-2433. [7]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2018:6077-6086. [8]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2019:6281-6290. [9]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2015:1-9. [10]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:4613-4621. [11]REN S,HE K,GIRSHICK R,et al.Faster rcnn:Towards real-time object detection with region proposal networks[C]//Advances in NeuralInformation Processing Systems 28.Cambridge:MIT Press,2015:91-99. [12]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems 31.Cambridge:MIT Press,2018:1571-1581. [13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems 30.Cambridge:MIT Press,2017:5998-6008. [14]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:21-29. [15]LU P,LI H,ZHANG W,et al.Co-attending freeform regionsand detections with multi-modal multiplicative feature embedding for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park:AAAI Press,2018:7218-7225. [16]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2017:1839-1848. [17]ZHOU B,TIAN Y,SUKHBAATAR S,et al.Simple baseline for visual question answering[J].arXiv,2015,1512.02167. [18]SCHWARTZ I,SCHWING A,HAZAN T.High-orderattention models for visual question answering[C]//Advances in Neural Information Processing Systems 30.Cambridge:MIT Press,2017:3664-3674. [19]BENYOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2612-2620. [20]NAM H,HA J W,KIM J.Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern recognition.New York:IEEE Press,2017:299-307. [21]NGUYEN D K,OKATANI T.Improved fusion ofvisual andlanguage representations by densesymmetric coattention for vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2018:6087-6096. [22]FU P C,YANG G,LIU X M,et al.Visual Question Answering Model Based on Spatial Relation and Frequency Feature[J].Computer Engineering,2022,48(9):96-104. [23]PENG L,YANG Y,BIN Y,et al.Word-to-region attention network for visual question answering[J].Multimedia Tools and Applications,2019,78:3843-3858. [24]GUAN W,WU Z,PING W.Question-oriented cross-modal co-attention networks for visual question answering[C]//2022 2nd International Conference on Consumer Electronics and Compu-ter Engineering.New York:IEEE Press,2022:401-407. [25]HOCHREITER S,SCHMIDHUBER J.Long short term memory[J].Neural Computation,1997,9(8):1735-1780. [26]LI C,LI L,QI J.A self-attentive model with gate mechanism for spoken language understanding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL,2018:3824-3833. [27]RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2021:1653-1662. [28]KRISHNA R,ZHU Y,GROTH O,et al.Visual g-enome:Connecting language and vision using crowd sourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73. [29]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115:211-252. [30]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:770-778. [31]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL,2014:1532-1543. [32]NGUYEN B X,DO T,TRAN H,et al.Coarse-to-fine reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2022:4558-4566. [33]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2017:6904-6913. [34]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision ECCV 2014:13th European Conference.Berlin:Springer,2014:740-755. [35]KINGMA D P,BA J.Adam:A method for stochasticoptimiza-tion[C]//3rd International Conference on Learning Representations.Ithaca,2015. [36]KIM W,SON B,KIM I.Vilt:Vision and language transformer without convolution or region supervision[C]//International Conference on Machine Learning.New York:ACM,2021:5583-5594. [37]QIAN Y,HU Y,WANG R,et al.Question Driven Graph Fusion Network For Visual Question Answering[C]//2022 IEEE International Conference on Multimedia and Expo.New York:IEEE Press,2022:1-6. |
|