Computer Science ›› 2021, Vol. 48 ›› Issue (3): 87-96.doi: 10.11896/jsjkx.201200174
Special Issue: Advances on Multimedia Technology
• Advances on Multimedia Technology • Previous Articles Next Articles
NIU Yu-lei, ZHANG Han-wang
CLC Number:
[1]YU J,WANG L,YU Z.Research on Visual Question Answering Techniques[J].Journal of Computer Research and Development,2018,55(9):1946-1958. [2]QI J,NIU Y,HUANG J,et al.Two causal principles for improving visual dialog[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10860-10869. [3]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [4]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913. [5]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Springer,Cham,2014:740-755. [6]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335. [7]MALINOWSKI M,FRITZ M.A Multi-world Approach toQuestion Answering about Real-world Scenes based on Uncertain Input[C]//Twenty-Eighth Annual Conference on Neural Information Processing Systems.Curran,2014:1682-1690. [8]REN M,KIROS R,ZEMEL R S.Exploring models and data for image question answering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:2953-2961. [9]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? Dataset and methods for multilingual image question answering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:2296-2304. [10]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [11]ZHU Y,GROTH O,BERNSTEIN M,et al.Visual7w:Groun-ded question answering in images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4995-5004. [12]YU L,PARK E,BERG A C,et al.Visual madlibs:Fill in the blank image generation and question answering[J].arXiv:1506.00278,2015. [13]DE VRIES H,STRUB F,CHANDAR S,et al.Guesswhat?!visual object discovery through multi-modal dialogue[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5503-5512. [14]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision.Springer,Cham,2016:451-466. [15]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29. [16]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:289-297. [17]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290. [18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [19]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1571-1581. [20]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468. [21]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[J].arXiv:1610.04325,2016. [22]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830. [23]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2612-2620. [24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bili-near superdiagonal fusion for vi-sual question answering and vi-sual relationship detection[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:8102-8109. [25]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neural module networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48. [26]HU R,ANDREAS J,ROHRBACH M,et al.Learning to rea-son:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813. [27]HU R,ANDREAS J,DARRELL T,et al.Explainable neuralcomputation via stack neural module networks[C]//Procee-dings of the European Conference on Computer Vision (ECCV).2018:53-69. [28]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998. [29]MASCHARKA D,TRAN P,SOKLASKI R,et al.Transparency by design:Closing the gap between performance and interpretability in visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2018:4942-4950. [30]YI K,WU J,GAN C,et al.Neural-symbolic VQA:disentangling reasoning from vision and language understanding[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1039-1050. [31]SHI J,ZHANG H,LI J.Explainable and explicit visual reaso-ning over scene graphs[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:8376-8384. [32]VEDANTAM R,DESAI K,LEE S,et al.Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering[C]//International Conference on Machine Learning.2019:6428-6437. [33]CHEN W,GAN Z,LI L,et al.Meta module network for compositional visual reasoning[J].arXiv:1910.03230,2019. [34]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:2901-2910. [35]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998. [36]LI L,GAN Z,CHENG Y,et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10313-10322. [37]HU R,ROHRBACH A,DARRELL T,et al.Language-conditioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303. [38]KAFLE K,YOUSEFHUSSIEN M,KANAN C.Data augmentation for visual question answering[C]//Proceedings of the 10th International Conference on Natural Language Generation.2017:198-202. [39]RAY A,SIKKA K,DIVAKARAN A,et al.Sunny and DarkOutside?! Improving Answer Consistency in VQA through Entailed Question Generation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:5863-5868. [40]SHAH M,CHEN X,ROHRBACH M,et al.Cycle-consistencyfor robust visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6649-6658. [41]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014. [42]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [43]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255. [44]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543. [45]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [46]JIANG H,MISRA I,ROHRBACH M,et al.In Defense of Grid Features for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276. [47]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23. [48]SU W,ZHU X,CAO Y,et al.V-bert:Pre-training of genericvisual-linguistic representations[J].arXiv:1908.08530,2019. [49]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:5103-5114. [50]CHEN Y C,LI L,YU L,et al.Uniter:Learning universal image-text representations[J].arXiv:1909.11740,2019. [51]LI X,YIN X,LI C,et al.Oscar:Object-semantics aligned pre-training for vision-language tasks[C]//European Conference on Computer Vision.Springer,Cham,2020:121-137. [52]LU J,GOSWAMI V,ROHRBACH M,et al.12-in-1:Multi-task vision and language representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10437-10446. [53]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019. [54]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2556-2565. [55]ORDONEZ V,KULKARNI G,BERG T L.Im2Text:describing images using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151. [56]JING C,WU Y,ZHANG X,et al.Overcoming Language Priors in VQA via Decomposed Linguistic Representations[C]//AAAI.2020:11181-11188. [57]KV G,MITTAL A.Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder[J].arXiv:2007.06198,2020. [58]SELVARAJU R R,LEE S,SHEN Y,et al.Taking a hint:Leveraging explanations to make vision and language models more grounded[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2019:2591-2600. [59]WU J,MOONEY R.Self-critical reasoning for robust visualquestion answering[C]//Advances in Neural Information Processing Systems.2019:8604-8614. [60]DAS A,AGRAWAL H,ZITNICK L,et al.Human Attention in Visual Question Answering:Do Humans and Deep Networks look at the same regions?[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:932-937. [61]HUK PARK D,ANNE HENDRICKS L,AKATA Z,et al.Multimodal explanations:Justifying decisions and pointing to the evi-dence[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8779-8788. [62]RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1548-1558. [63]CADENE R,DANCETTE C,CORD M,et al.Rubi:Reducingunimodal biases for visual question answering[C]//Advances in neural information processing systems.2019:841-852. [64]CLARK C,YATSKAR M,ZETTLEMOYER L.Don’t Takethe Easy Way Out:Ensemble Based Methods for Avoiding Known Dataset Biases[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:4060-4073. [65]ABBASNEJAD E,TENEY D,PARVANEH A,et al.Counterfactual vision and language learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10044-10054. [66]TENEY D,ABBASNEJAD E,HENGEL A.Unshuffling datafor improved generalization[J].arXiv:2002.11894,2020. [67]TENEY D,KAFLE K,SHRESTHA R,et al.On the Value of Out-of-Distribution Testing:An Example of Goodhart’s Law[J].arXiv:2005.09241,2020. [68]ZHU X,MAO Z,LIU C,et al.Overcoming Language Priorswith Self-supervised Learning for Visual Question Answering[J].arXiv:2012.11528,2020. [69]CHEN L,YAN X,XIAO J,et al.Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10800-10809. [70]LIANG Z,JIANG W,HU H,et al.Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:3285-3292. [71]GOKHALE T,BANERJEE P,BARAL C,et al.MUTANT:A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:878-892. [72]LU J,KANNAN A,YANG J,et al.Best of both worlds:transferring knowledge from discriminative learning to a generative visual dialog model[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:313-323. [73]WU Q,WANG P,SHEN C,et al.Are you talking to me? reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115. [74]GUO D,XU C,TAO D.Image-question-answer synergistic network for visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10434-10443. [75]SEO P H,LEHRMANN A,HAN B,et al.Visual reference re-solution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729. [76]KOTTUR S,MOURA J M F,PARIKH D,et al.Visual corefe-rence resolution in visual dialog using neural module networks[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:153-169. [77]NIU Y,ZHANG H,ZHANG M,et al.Recursive visual attention in visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6679-6688. [78]KANG G C,LIM J,ZHANG B T.Dual Attention Networks for Visual Reference Resolution in Visual Dialog[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:2024-2033. [79]GAN Z,CHENG Y,KHOLY A,et al.Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:6463-6474. [80]SCHWARTZ I,YU S,HAZAN T,et al.Factor graph attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2039-2048. [81]ZHENG Z,WANG W,QI S,et al.Reasoning visual dialogs with structural and partial observations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6669-6678. [82]JIANG X,YU J,QIN Z,et al.DualVD:An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue[C]//AAAI.2020,1(3):5. [83]MURAHARI V,BATRA D,PARIKH D,et al.Large-scale pretraining for visual dialog:A simple state-of-the-art baseline[J].arXiv:1912.02379,2019. [84]WANG Y,JOTY S,LYU M R,et al.Vd-bert:A unified vision and dialog transformer with bert[J].arXiv:2004.13278,2020. [85]NIU Y,TANG K,ZHANG H,et al.Counterfactual VQA:A Cause-Effect Look at Language Bias[J].arXiv:2006.04315,2020. [86]TANG K,NIU Y,HUANG J,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3716-3725. [87]YANG X,ZHANG H,CAI J.Deconfounded image captioning:A causal retrospect[J].arXiv:2003.03923,2020. [88]WANG T,HUANG J,ZHANG H,et al.Visual commonsense r-cnn[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2020:10760-10770. [89]TANG K,HUANG J,ZHANG H.Long-tailed classification by keeping the good and removing the bad momentum causal effect[J].arXiv:2009.12991,2020. [90]YUE Z,ZHANG H,SUN Q,et al.Interventional few-shotlearning[J].arXiv:2009.13000,2020. [91]ZHANG D,ZHANG H,TANG J,et al.Causal intervention for weakly-supervised semantic segmentation[J].arXiv:2009.12547,2020. [92]MASSICETI D,DOKANIA P K,SIDDHARTH N,et al.Visual dialogue without vision or dialogue[J].arXiv:1812.06417,2018. [93]AGARWAL S,BUI T,LEE J Y,et al.History for Visual Dialog:Do we really need it?[J].arXiv:2005.07493,2020. [94]MASSICETI D,KULHARIA V,DOKANIA P K,et al.A Revised Generative Evaluation of Visual Dialogue[J].arXiv:2004.09272,2020. [95]SINGH A,NATARAJAN V,SHAH M,et al.Towards vqamodels that can read[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8317-8326. [96]BITEN A F,TITO R,MAFLA A,et al.Scene text visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:4291-4301. [97]GURARI D,LI Q,STANGL A J,et al.Vizwiz grand challenge:Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3608-3617. [98]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6700-6709. [99]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731. |
[1] | XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171. |
[2] | RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207. |
[3] | TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305. |
[4] | WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293. |
[5] | HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329. |
[6] | JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335. |
[7] | SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177. |
[8] | HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78. |
[9] | CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126. |
[10] | HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163. |
[11] | ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169. |
[12] | SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235. |
[13] | WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324. |
[14] | CHU Yu-chun, GONG Hang, Wang Xue-fang, LIU Pei-shun. Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 [J]. Computer Science, 2022, 49(6A): 337-344. |
[15] | LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11. |
|