Computer Science ›› 2024, Vol. 51 ›› Issue (5): 85-91.doi: 10.11896/jsjkx.230300202

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Cross-modal Information Filtering-based Networks for Visual Question Answering

HE Shiyang1, WANG Zhaohui2, GONG Shengrong1,3, ZHONG Shan3   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215008,China
    2 Soochow College,Soochow University,Suzhou,Jiangsu 215006,China
    3 School of Computer Science and Engineering,Changshu Institute of Technology,Suzhou,Jiangsu 215500,China
  • Received:2023-03-26 Revised:2023-08-09 Online:2024-05-15 Published:2024-05-08
  • About author:HE Shiyang,born in 1995,postgra-duate.His main research interests include machine learning and computer vision.
    GONG Shengrong,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include image and video processing,pattern recognition and computer vision.
  • Supported by:
    National Natural Science Foundation of China(61972059,42071438),Natural Science Foundation of Jiangsu Province,China(BK20191474,BK20191475) and Key Laboratory of Symbolic Computing and Knowledge Engineering, Ministry of Education,Jilin University(93K172021K01).

Abstract: As a multi-modal task,the bottleneck of visual question answering(VQA) is to solve the problem of fusion between different modes.It requires not only a full understanding of vision and text in the image,but also the ability to align cross-modal representation.The introduction of the attention mechanism provides an effective path for multi-mode fusion.However,the pre-vious methods usually calculate the extracted image features directly,ignoring the noise and incorrect information contained in the image features,and most of the methods are limited to the shallow interaction between modes,without considering the deep semantic information between modes.To solve this problem,a cross-modal information filtering network(CIFN) is proposed.Firstly,the feature of the problem is taken as the supervision signal,and the information filtering module is designed to filter the feature information of the image,so that it can better fit the representation of the problem.Then the image features and problem features are sent to the cross-modal interaction layer,and the intra-modal and inter-modal relationships are modeled respectively under the action of self-attention and guided attention,so as to obtain more fine-grained multi-modal features.Extensive experiments have been conducted on VQA2.0 data sets,and the experimental results show that the introduction of information filtering mo-dule effectively improves the model accuracy,and the overall accuracy of test-std reaches 71.51%,which has good performance compared with the most advanced methods.

Key words: Visual question answering, Deep learning, Attention mechanism, Multi-modal fusion, Information filtering

CLC Number: 

  • TP391
[1]YAN F,MIKOLAJCZYK K.Deep correlation for matching images and text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2015:3441-3450.
[2]WANG Y,YANG H,QIAN X,et al.Position focused attentionnetwork for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann,2019:3792-3798.
[3]YOU Q,JIN H,WANG Z,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.New York:IEEE Press,2016:4651-4659.
[4]LI G,ZHU L,LIU P,et al.Entangled transf-ormer for imagecaptioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2019:8928-8937.
[5]NGUYEN K,TRIPATHI S,DU B,et al.In defense of scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2021:1407-1416.
[6]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2015:2425-2433.
[7]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2018:6077-6086.
[8]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2019:6281-6290.
[9]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2015:1-9.
[10]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:4613-4621.
[11]REN S,HE K,GIRSHICK R,et al.Faster rcnn:Towards real-time object detection with region proposal networks[C]//Advances in NeuralInformation Processing Systems 28.Cambridge:MIT Press,2015:91-99.
[12]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems 31.Cambridge:MIT Press,2018:1571-1581.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems 30.Cambridge:MIT Press,2017:5998-6008.
[14]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:21-29.
[15]LU P,LI H,ZHANG W,et al.Co-attending freeform regionsand detections with multi-modal multiplicative feature embedding for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park:AAAI Press,2018:7218-7225.
[16]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.New York:IEEE Press,2017:1839-1848.
[17]ZHOU B,TIAN Y,SUKHBAATAR S,et al.Simple baseline for visual question answering[J].arXiv,2015,1512.02167.
[18]SCHWARTZ I,SCHWING A,HAZAN T.High-orderattention models for visual question answering[C]//Advances in Neural Information Processing Systems 30.Cambridge:MIT Press,2017:3664-3674.
[19]BENYOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[20]NAM H,HA J W,KIM J.Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern recognition.New York:IEEE Press,2017:299-307.
[21]NGUYEN D K,OKATANI T.Improved fusion ofvisual andlanguage representations by densesymmetric coattention for vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2018:6087-6096.
[22]FU P C,YANG G,LIU X M,et al.Visual Question Answering Model Based on Spatial Relation and Frequency Feature[J].Computer Engineering,2022,48(9):96-104.
[23]PENG L,YANG Y,BIN Y,et al.Word-to-region attention network for visual question answering[J].Multimedia Tools and Applications,2019,78:3843-3858.
[24]GUAN W,WU Z,PING W.Question-oriented cross-modal co-attention networks for visual question answering[C]//2022 2nd International Conference on Consumer Electronics and Compu-ter Engineering.New York:IEEE Press,2022:401-407.
[25]HOCHREITER S,SCHMIDHUBER J.Long short term memory[J].Neural Computation,1997,9(8):1735-1780.
[26]LI C,LI L,QI J.A self-attentive model with gate mechanism for spoken language understanding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL,2018:3824-3833.
[27]RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2021:1653-1662.
[28]KRISHNA R,ZHU Y,GROTH O,et al.Visual g-enome:Connecting language and vision using crowd sourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[29]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115:211-252.
[30]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2016:770-778.
[31]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL,2014:1532-1543.
[32]NGUYEN B X,DO T,TRAN H,et al.Coarse-to-fine reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2022:4558-4566.
[33]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2017:6904-6913.
[34]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision ECCV 2014:13th European Conference.Berlin:Springer,2014:740-755.
[35]KINGMA D P,BA J.Adam:A method for stochasticoptimiza-tion[C]//3rd International Conference on Learning Representations.Ithaca,2015.
[36]KIM W,SON B,KIM I.Vilt:Vision and language transformer without convolution or region supervision[C]//International Conference on Machine Learning.New York:ACM,2021:5583-5594.
[37]QIAN Y,HU Y,WANG R,et al.Question Driven Graph Fusion Network For Visual Question Answering[C]//2022 IEEE International Conference on Multimedia and Expo.New York:IEEE Press,2022:1-6.
[1] BAO Kainan, ZHANG Junbo, SONG Li, LI Tianrui. ST-WaveMLP:Spatio-Temporal Global-aware Network for Traffic Flow Prediction [J]. Computer Science, 2024, 51(5): 27-34.
[2] ZHANG Jianliang, LI Yang, ZHU Qingshan, XUE Hongling, MA Junwei, ZHANG Lixia, BI Sheng. Substation Equipment Malfunction Alarm Algorithm Based on Dual-domain Sparse Transformer [J]. Computer Science, 2024, 51(5): 62-69.
[3] SONG Jianfeng, ZHANG Wenying, HAN Lu, HU Guozheng, MIAO Qiguang. Multi-stage Intelligent Color Restoration Algorithm for Black-and-White Movies [J]. Computer Science, 2024, 51(5): 92-99.
[4] SHAN Xinxin, LI Kai, WEN Ying. Medical Image Segmentation Network Integrating Full-scale Feature Fusion and RNN with Attention [J]. Computer Science, 2024, 51(5): 100-107.
[5] ZHOU Yu, CHEN Zhihua, SHENG Bin, LIANG Lei. Multi Scale Progressive Transformer for Image Dehazing [J]. Computer Science, 2024, 51(5): 117-124.
[6] BAI Xuefei, SHEN Wucheng, WANG Wenjian. Salient Object Detection Based on Feature Attention Purification [J]. Computer Science, 2024, 51(5): 125-133.
[7] HE Xiaohui, ZHOU Tao, LI Panle, CHANG Jing, LI Jiamian. Study on Building Extraction from Remote Sensing Image Based on Multi-scale Attention [J]. Computer Science, 2024, 51(5): 134-142.
[8] XU Xuejie, WANG Baohui. Multi-label Patent Classification Based on Text and Historical Data [J]. Computer Science, 2024, 51(5): 172-178.
[9] LAN Yongqi, HE Xingxing, LI Yingfang, LI Tianrui. New Graph Reduction Representation and Graph Neural Network Model for Premise Selection [J]. Computer Science, 2024, 51(5): 193-199.
[10] LI Zichen, YI Xiuwen, CHEN Shun, ZHANG Junbo, LI Tianrui. Government Event Dispatch Approach Based on Deep Multi-view Network [J]. Computer Science, 2024, 51(5): 216-222.
[11] HONG Tijing, LIU Dengfeng, LIU Yian. Radar Active Jamming Recognition Based on Multiscale Fully Convolutional Neural Network and GRU [J]. Computer Science, 2024, 51(5): 306-312.
[12] SUN Jing, WANG Xiaoxia. Convolutional Neural Network Model Compression Method Based on Cloud Edge Collaborative Subclass Distillation [J]. Computer Science, 2024, 51(5): 313-320.
[13] CHEN Runhuan, DAI Hua, ZHENG Guineng, LI Hui , YANG Geng. Urban Electricity Load Forecasting Method Based on Discrepancy Compensation and Short-termSampling Contrastive Loss [J]. Computer Science, 2024, 51(4): 158-164.
[14] LIN Binwei, YU Zhiyong, HUANG Fangwan, GUO Xianwei. Data Completion and Prediction of Street Parking Spaces Based on Transformer [J]. Computer Science, 2024, 51(4): 165-173.
[15] WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping. Review of Vision-based Neural Network 3D Dynamic Gesture Recognition Methods [J]. Computer Science, 2024, 51(4): 193-208.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!