Computer Science ›› 2022, Vol. 49 ›› Issue (12): 229-235.doi: 10.11896/jsjkx.220600038

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Visual Question Answering Method Based on Counterfactual Thinking

YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng   

  1. School of Information and Communication Engineering,University of Electronic Science and Technology of China,Chengdu 611730,China
  • Received:2022-06-06 Revised:2022-08-16 Published:2022-12-14
  • About author:YUAN De-sen,born in 1997,postgra-duate.His main research interests include multi-modal learning and deep learning.WU Qing-bo,born in 1985,Ph.D,asso-ciate professor,master supervisor,is a member of China Computer Federation.His main research interests include image and video coding,image and video quality assessment and visual perception model.
  • Supported by:
    National Natural Science Foundation of China(61831005,61971095).

Abstract: Visual question answering(VQA) is a multi-modal task that combines computer vision and natural language proces-sing,which is extremely challenging.However,the current VQA model is often misled by the apparent correlation in the data,and the output of the model is directly guided by language bias.Many previous researches focus on solving language bias and assisting the model via counterfactual sample methods.These studies,however,ignore the prediction information and the difference between key features and non-key features in counterfactual samples.The proposed model can distinguish the difference between the original sample,the factual sample and the counterfactual sample.In view of this,this paper proposes a paradigm of contrastive learning based on counterfactual samples.By comparing these three samples in terms of feature gaps and prediction gaps,the VQA model has been significantly improved in its robustness.Compared with CL-VQA method,the overall precision,average precision and Num index of this method improves by 0.19%,0.89% and 2.6% respectively.Compared with the CSSVQA method,the Gap of the proposed method decrease to 0.45 from 0.96.

Key words: Visual question answering, Causal inference, Counterfactual thinking, Contrastive learning, Deep learning

CLC Number: 

  • TP391
[1]NIU Y L,ZHANG H W.Survey on Visual Question Answering and Dialogue [J].Computer Science,2021,48(3):87-96.
[2]FU P C,YANG G,LIU X M,et al.Visual Question Answering Network Method Based on Spatial Relationship and Frequency[J].Computer Engineering,2022,48(9):96-104.
[3]ZOU P R,XIAO F,ZHANG W J,et al.Multi-Module Co-Atten-tion Model for Visual Question Answering[J].Computer Engineering,2022,48(2):250-260.
[4]WU A M,JIANG P,HAN Y H.Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J].Computer Science,2021,48(3):71-78.
[5]XU S,ZHU Y X.Study on Question Processing Algorithms in Visual Question Answering [J].Computer Science,2020,47(11):226-230.
[6]WANG S H,YAN X,HUANG Q M.Overview of Research on Cross-media Analysis and Reasoning Technology [J].Computer Science,2021,48(3):79-86.
[7]YUAN D.Language bias in Visual Question Answering:A Survey and Taxonomy [J].arXiv:2111.08531,2021.
[8]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago,Chile:IEEE,2015:2425-2433.
[9]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[J].International Journal of Computer Vision,2017,123(1):4-31.
[10]AGRAWAL A,BATRA D,PARIKH D.Analyzing the behavior of visual question answering models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing.Austin,USA:ACL,2016:1955-1960.
[11]PENG Z,GOYAL Y, SUMMERS-STAY D,et al.Yin andyang:Balancing and answering binary visual question[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Las Vegas,USA:IEEE,2016:5014-5022.
[12]JUSTIN J,HARIHARAN B,MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition.Hawaii,USA:IEEE,2017:2901-2910.
[13]YASH G,TEJAS K,SUMMERS-STAY D,et al,Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Honolulu,USA:IEEE,2017:6904-6913.
[14]CHEN L,YAN X,XIAO J,et al.Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,USA:IEEE,2020:10800-10809.
[15]LIANG Z,JIANG W,HU H,et al.Learning to contrast the counterfactual samples for robust visual question answering[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg,USA:ACL,2020:3285-3292.
[16]AGRAWAL A,BATRA D,PARIKH D,et al.Don’t just assume;look and answer:Overcoming priors for visual question answering[C]//Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2018:4971-4980.
[17]SELVARAJU R R,LEE S,SHEN Y,et al.Taking a hint:Leveraging explanations to make vision and language models more grounded[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.Seoul,Korean,IEEE,2019:2591-2600.
[18]WU J L MOONEY R J.Self-critical reasoning for robust visual question answering[J].arXiv:1905.09998,2019.
[19]RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[J].arXiv:1810.03649,2018.
[20]REMI C,CORENTIN D.Rubi:Reducing unimodal biases in visual question answering[C]//Advances in Neural Information Processing Systems.2019.
[21]CLARK C,YATSKAR M,ZETTLEMOYER L.Don’t take the easy way out:Ensemble based methods for avoiding known dataset biases[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Hongkong,China:IEEE,2019:4069-4082.
[22]ABBASNEJAD E,TENEY D,PARVANEH A,et al.Counterfactual vision and language learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2020:10044-10054.
[23]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial nets[J].arXiv:1406.2661v1,2014.
[24]TENEY D,ABBASNEDJAD E,VAN DEN HENGELA.Learning what makes a difference from counterfactual examples and gradient supervision[C]//Proceedings of European Confe-rence on Computer Vision.Glasgow,UK:Springer,2020:580-599.
[25]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2018:6077-6086.
[26]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//NIPS.2016.
[27]GREFF K,SRIVASTAVA R K,KOUTNÍK J,et al.LSTM:A search space odyssey[J].IEEE Transactions on Neural Networks and Learning Systems,2016,28(10):2222-2232.
[28]PEARL J.Causality:models,reasoning and inference[M].Cambridge:Cambridge University Press,2000.
[29]PEARL J.Direct and indirect effects[C]//Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence.Morgan Kaufmann Publishers Inc.,2001:411-420.
[30]PEARL J,MACKENZIE D.The book of why:the new science of cause and effect,Basic Books[J].Science,2018,361:47-54.
[31]HARARI,Y.A brief history of humankind[M].Beijing:CITIC Publishing House,2014.
[32]WEISBERG D S,GOPNIK A.Pretense,counterfactuals,andBayesian causal models:Why what is not real really matters[J].Cognitive Science,2013,37(7):1368-1381.
[33]ROESE N J,EPSTUDE K.The functional theory of counterfactual thinking:New evidence,new challenges,new insights[J].Advances in Experimental Social Psychology,2017,56:1-79.
[34]BRIGARD F D,ADDIS D R,FORD J H,et al.Remembering what could have happened:Neural correlates of episodic counterfactual thinking[J].Neuropsychologia,2013,51(12):2401-2414.
[35]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2019.
[36]BENT F,FLEMMING T.Jensen-shannon divergence and hil-bert space embedding[C]//Proceedings of International Symposium on Information Theory.Chicago,USA,IEEE,2004.
[37]LIN J.Divergence measures based on the Shannon entropy[J].IEEE Transactions on Information theory,1991,37(1):145-151.
[38]GAT I,SCHWARTZ I,SCHWING A,et al.Removing bias in multi-modal classifiers:Regularization by maximizing functional entropies[J].Advances in Neural Information Processing Systems,2020,33:3197-3208.
[39]TENEY D,KAFLE K,SHRESTHA R,et al.On the value of out-of-distribution testing:An example of goodhart's law[J].Advances in Neural Information Processing Systems,2020,33:407-417.
[40]NIU Y,TANG K,ZHANG H,et al.Counterfactual vqa:Acause-effect look at language bias[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Online,IEEE,2021:12700-12710.
[41]HAN X,WANG S,SU C,et al.Greedy gradient ensemble for robust visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Online,IEEE,2021:1584-1593.
[1] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[5] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[6] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[7] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[8] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[9] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[10] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[11] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[12] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[13] WU Zi-bin, YAN Qiao. Projected Gradient Descent Algorithm with Momentum [J]. Computer Science, 2022, 49(6A): 178-183.
[14] XU Guo-ning, CHEN Yi-peng, CHEN Yi-ming, CHEN Jin-yin, WEN Hao. Data Debiasing Method Based on Constrained Optimized Generative Adversarial Networks [J]. Computer Science, 2022, 49(6A): 184-190.
[15] LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!