基于反事实思考的视觉问答方法

doi:10.11896/jsjkx.220600038

摘要/Abstract

摘要： 视觉问答是一项结合计算机视觉和自然语言处理的多模态任务,具有极大的挑战性。然而,目前的视觉问答模型存在着严重的语言偏见问题,对其鲁棒性有负面影响。以往的研究主要集中在利用生成反事实样本来辅助模型解决语言偏见。然而,这些研究忽略了分析反事实样本与原始样本的预测差异以及关键特征与非关键特征之间的两两差异。文中通过建立反事实思考流程,结合因果推理与对比学习,使模型能够区分原始样本、事实样本和反事实样本。基于此,提出了一种基于反事实样本的对比学习范式。通过对比3类样本对的特征差异和预测差异,减小了模型的语言偏见。在VQA-CP v2等数据集上的实验证明了所提方法的有效性。与CL-VQA方法相比,所提方法的整体精度提高了0.19%,平均精度提高了0.89%,尤其是Num精度提高了2.6%。相比CSSVQA方法,所提方法的鲁棒性辅助指标Gap从0.96提高到了0.45。

关键词: 视觉问答, 因果推理, 反事实思考, 对比学习, 深度学习

Abstract: Visual question answering(VQA) is a multi-modal task that combines computer vision and natural language proces-sing,which is extremely challenging.However,the current VQA model is often misled by the apparent correlation in the data,and the output of the model is directly guided by language bias.Many previous researches focus on solving language bias and assisting the model via counterfactual sample methods.These studies,however,ignore the prediction information and the difference between key features and non-key features in counterfactual samples.The proposed model can distinguish the difference between the original sample,the factual sample and the counterfactual sample.In view of this,this paper proposes a paradigm of contrastive learning based on counterfactual samples.By comparing these three samples in terms of feature gaps and prediction gaps,the VQA model has been significantly improved in its robustness.Compared with CL-VQA method,the overall precision,average precision and Num index of this method improves by 0.19%,0.89% and 2.6% respectively.Compared with the CSSVQA method,the Gap of the proposed method decrease to 0.45 from 0.96.

Key words: Visual question answering, Causal inference, Counterfactual thinking, Contrastive learning, Deep learning

中图分类号:

TP391

袁德森, 刘修敬, 吴庆波, 李宏亮, 孟凡满, 颜庆义, 许林峰. 基于反事实思考的视觉问答方法[J]. 计算机科学, 2022, 49(12): 229-235. https://doi.org/10.11896/jsjkx.220600038

YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng. Visual Question Answering Method Based on Counterfactual Thinking[J]. Computer Science, 2022, 49(12): 229-235. https://doi.org/10.11896/jsjkx.220600038

参考文献

[1]NIU Y L,ZHANG H W.Survey on Visual Question Answering and Dialogue [J].Computer Science,2021,48(3):87-96.
[2]FU P C,YANG G,LIU X M,et al.Visual Question Answering Network Method Based on Spatial Relationship and Frequency[J].Computer Engineering,2022,48(9):96-104.
[3]ZOU P R,XIAO F,ZHANG W J,et al.Multi-Module Co-Atten-tion Model for Visual Question Answering[J].Computer Engineering,2022,48(2):250-260.
[4]WU A M,JIANG P,HAN Y H.Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J].Computer Science,2021,48(3):71-78.
[5]XU S,ZHU Y X.Study on Question Processing Algorithms in Visual Question Answering [J].Computer Science,2020,47(11):226-230.
[6]WANG S H,YAN X,HUANG Q M.Overview of Research on Cross-media Analysis and Reasoning Technology [J].Computer Science,2021,48(3):79-86.
[7]YUAN D.Language bias in Visual Question Answering:A Survey and Taxonomy [J].arXiv:2111.08531,2021.
[8]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago,Chile:IEEE,2015:2425-2433.
[9]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[J].International Journal of Computer Vision,2017,123(1):4-31.
[10]AGRAWAL A,BATRA D,PARIKH D.Analyzing the behavior of visual question answering models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing.Austin,USA:ACL,2016:1955-1960.
[11]PENG Z,GOYAL Y, SUMMERS-STAY D,et al.Yin andyang:Balancing and answering binary visual question[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Las Vegas,USA:IEEE,2016:5014-5022.
[12]JUSTIN J,HARIHARAN B,MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition.Hawaii,USA:IEEE,2017:2901-2910.
[13]YASH G,TEJAS K,SUMMERS-STAY D,et al,Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Honolulu,USA:IEEE,2017:6904-6913.
[14]CHEN L,YAN X,XIAO J,et al.Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,USA:IEEE,2020:10800-10809.
[15]LIANG Z,JIANG W,HU H,et al.Learning to contrast the counterfactual samples for robust visual question answering[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg,USA:ACL,2020:3285-3292.
[16]AGRAWAL A,BATRA D,PARIKH D,et al.Don’t just assume;look and answer:Overcoming priors for visual question answering[C]//Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2018:4971-4980.
[17]SELVARAJU R R,LEE S,SHEN Y,et al.Taking a hint:Leveraging explanations to make vision and language models more grounded[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.Seoul,Korean,IEEE,2019:2591-2600.
[18]WU J L MOONEY R J.Self-critical reasoning for robust visual question answering[J].arXiv:1905.09998,2019.
[19]RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[J].arXiv:1810.03649,2018.
[20]REMI C,CORENTIN D.Rubi:Reducing unimodal biases in visual question answering[C]//Advances in Neural Information Processing Systems.2019.
[21]CLARK C,YATSKAR M,ZETTLEMOYER L.Don’t take the easy way out:Ensemble based methods for avoiding known dataset biases[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Hongkong,China:IEEE,2019:4069-4082.
[22]ABBASNEJAD E,TENEY D,PARVANEH A,et al.Counterfactual vision and language learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2020:10044-10054.
[23]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial nets[J].arXiv:1406.2661v1,2014.
[24]TENEY D,ABBASNEDJAD E,VAN DEN HENGELA.Learning what makes a difference from counterfactual examples and gradient supervision[C]//Proceedings of European Confe-rence on Computer Vision.Glasgow,UK:Springer,2020:580-599.
[25]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA:IEEE,2018:6077-6086.
[26]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//NIPS.2016.
[27]GREFF K,SRIVASTAVA R K,KOUTNÍK J,et al.LSTM:A search space odyssey[J].IEEE Transactions on Neural Networks and Learning Systems,2016,28(10):2222-2232.
[28]PEARL J.Causality:models,reasoning and inference[M].Cambridge:Cambridge University Press,2000.
[29]PEARL J.Direct and indirect effects[C]//Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence.Morgan Kaufmann Publishers Inc.,2001:411-420.
[30]PEARL J,MACKENZIE D.The book of why:the new science of cause and effect,Basic Books[J].Science,2018,361:47-54.
[31]HARARI,Y.A brief history of humankind[M].Beijing:CITIC Publishing House,2014.
[32]WEISBERG D S,GOPNIK A.Pretense,counterfactuals,andBayesian causal models:Why what is not real really matters[J].Cognitive Science,2013,37(7):1368-1381.
[33]ROESE N J,EPSTUDE K.The functional theory of counterfactual thinking:New evidence,new challenges,new insights[J].Advances in Experimental Social Psychology,2017,56:1-79.
[34]BRIGARD F D,ADDIS D R,FORD J H,et al.Remembering what could have happened:Neural correlates of episodic counterfactual thinking[J].Neuropsychologia,2013,51(12):2401-2414.
[35]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2019.
[36]BENT F,FLEMMING T.Jensen-shannon divergence and hil-bert space embedding[C]//Proceedings of International Symposium on Information Theory.Chicago,USA,IEEE,2004.
[37]LIN J.Divergence measures based on the Shannon entropy[J].IEEE Transactions on Information theory,1991,37(1):145-151.
[38]GAT I,SCHWARTZ I,SCHWING A,et al.Removing bias in multi-modal classifiers:Regularization by maximizing functional entropies[J].Advances in Neural Information Processing Systems,2020,33:3197-3208.
[39]TENEY D,KAFLE K,SHRESTHA R,et al.On the value of out-of-distribution testing:An example of goodhart's law[J].Advances in Neural Information Processing Systems,2020,33:407-417.
[40]NIU Y,TANG K,ZHANG H,et al.Counterfactual vqa:Acause-effect look at language bias[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Online,IEEE,2021:12700-12710.
[41]HAN X,WANG S,SU C,et al.Greedy gradient ensemble for robust visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Online,IEEE,2021:1584-1593.

相关文章 15

[1]	徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3]	汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[5]	王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[6]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[7]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[8]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[9]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[10]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[11]	周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[12]	苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[13]	刘伟业, 鲁慧民, 李玉鹏, 马宁. 指静脉识别技术研究综述 Survey on Finger Vein Recognition Research 计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[14]	孙福权, 崔志清, 邹彭, 张琨. 基于多尺度特征的脑肿瘤分割算法 Brain Tumor Segmentation Algorithm Based on Multi-scale Features 计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217
[15]	康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩. 基于Transformer和LSTM的药物相互作用预测 Drug-Drug Interaction Prediction Based on Transformer and LSTM 计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed