Computer Science ›› 2023, Vol. 50 ›› Issue (1): 166-175.doi: 10.11896/jsjkx.211100237
• Artificial Intelligence • Previous Articles Next Articles
WANG Ruiping1,2, WU Shihong2, ZHANG Meihang3, WANG Xiaoping1
CLC Number:
[1]ZHANG W,YU J,ZHAO W,et al.DMRFNet:Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation[J].Information Fusion,2021,72:70-79. [2]LAO M,GUO Y,PU N,et al.Multi-stage hybrid embedding fusion network for visual question answering[J].Neurocompu-ting,2021,423:541-550. [3]FARAZI M,KHAN S,BARNES N.Accuracy vs.complexity:A trade-off in visual question answering models[J].Pattern Recognition,2021,120(1):108106. [4]LIN Y T.Research on Visual Question Answering Technology and Its Application[D].Hangzhou:Zhejiang University,2019. [5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago:IEEE Press,2015:2425-2433. [6]SU Z,ZHU C,DONG Y,et al.Learning visual knowledge me-mory networks for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:7736-7745. [7]LIU F,XIANG T,HOSPEDALES T M,et al.iVQA:Inversevisual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:8611-8619. [8]XIAN G J,HUANG Y Z.Review of research on visual question answering technology based on neural network[J].Network Security Technology and Application,2018,1:42-47. [9]YU J,WANG L,YU Z.Research on Visual Question Answering Technology[J].Computer Research and Development,2018,55(9):1946-1958. [10]NIU Y L,ZHANG H W.Overview of Visual Questions and Answers and Dialogues[J].Computer Science,2021,48(3):87-96. [11]WU A M,JIANG P,HAN Y.A Survey of Cross-Media Question Answering and Reasoning Based on Vision and Language[J].Computer Science,2021,48(3):71-78. [12]ZHANG D,CAO R,WU S.Information fusion in visual question answering:A Survey[J].Information Fusion,2019,52:268-280. [13]TENEY D,WU Q,VAN D.Visual question answering:A tutorial[J].IEEE Signal Processing Magazine,2017,34(6):63-75. [14]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40. [15]MANMADHAN S,KOVOOR B.Visual question answering:a state-of-the-art review[J].Artificial Intelligence Review,2020,53(8):5705-5745. [16]YU J,ZHANG W,LU Y,et al.Reasoning on the relation:Enhancing visual representation for visual question answering and cross-modal retrieval[J].IEEE Transactions on Multimedia,2020,22(12):3196-3209. [17]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//Proceedings of the European Conference on Computer Vision.Glasgow:Springer,2020:553-570. [18]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//Proceedings of the European Conference on Computer Vision.Amsterdam:Springer,2016:451-466. [19]DAS A,AGRAWAL H,ZITNICK L,et al.Human attention in visual question answering:Do humans and deep networks look at the same regions?[J].Computer Vision and Image Understanding,2017,163:90-100. [20]BOLISANI E,BRATIANU C.The elusive definition of know-ledge[M].Springer,2018:1-22. [21]YU J,ZHU Z,WANG Y,et al.Cross-modal knowledge reaso-ning for knowledge-based visual question answering[J].Pattern Recognition,2020,108:107563. [22]ZHU Y,GROTH O,BERNSTEIN M,et al.Visual7w:Groun-ded question answering in images[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:4995-5004. [23]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [24]ZHU Y,ZHANG C,RÉ C,et al.Building a Large-scale Multimodal Knowledge Base for Visual Question Answering[J].ar-Xiv:1507.05670,2015. [25]SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:an open multilingual graph of general knowledge[C]//Proceedings of the AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:4444-4451. [26]MATSUMORI S,SHINGYOUCHI K,ABE Y,et al.Unifiedquestioner transformer for descriptive question generation in goal-oriented visual dialogue[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Virtually:IEEE Press,2021:1898-1907. [27]GUO D,WANG H,WANG S,et al.Textual-visual reference-aware attention network for visual dialog[J].IEEE Transactions on Image Processing,2020,29:6655-6666. [28]LI Y,DUAN N,ZHOU B,et al.Visual question generation as dual task of visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:6116-6124. [29]XU X,WANG T,YANG Y,et al.Radial graph convolutionalnetwork for visual question generation[J].IEEE transactions on Neural Networks and Learning Systems,2020,32(4):1654-1667. [30]HAN X,WANG S,SU C,et al.Greedy gradient ensemble for robust visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Virtually:IEEE Press,2021:1584-1593. [31]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99. [32]FARHADI A,REDMON J.Yolov3:An incremental improve-ment[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:1804-2767. [33]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:770-778. [34]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014. [35]ADITYA S,YANG Y,BARAL C.Explicit reasoning over end-to-end neural architectures for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:629-637. [36]BAI Y,FU J,ZHAO T,et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:20-35. [37]GORDON D,KEMBHAVI A,RASTEGARI M,et al.Iqa:Vi-sual question answering in interactive environments[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:4089-4098. [38]LI W,YUAN Z,FANG X,et al.Knowing where to look?Ana-lysis on attention of visual question answering system[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:145-152. [39]RAHMAN T,CHOU S H,SIGAL L,et al.An Improved Attention for Visual Question Answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:1653-1662. [40]LIANG W,JIANG Y,LIU Z.GraghVQA:Language-GuidedGraph Neural Networks for Graph-based Visual Question Answering[J].arXiv:2104.10283,2021. [41]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167. [42]ZHANG W,YU J,HU H,et al.Multimodal feature fusion by relational reasoning and attention for visual question answering[J].Information Fusion,2020,55:116-126. [43]ZHANG L,LIU S,LIU D,et al.Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2020,32(10):4362-4373. [44]ZHU Y,LIM J J,LI F F.Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:1154-1163. [45]WU Q,SHEN C,WANG P,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(6):1367-1381. [46]WANG P,WU Q,SHEN C,et al.Fvqa:Fact-based visual question answering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(10):2413-2427. [47]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Lear-ning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014. [48]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).Doha:Association for Computational Linguistics,2014:1532-1543. [49]HOCHREITER S,SCHMIDHUBER J.Long short-term memo-ry[J].Neural Computation,1997,9(8):1735-1780. [50]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.New York:Curran Associates Inc.,2013:3111-3119. [51]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016. [52]RUWA N,MAO Q,WANG L,et al.Affective visual question answering network[C]//2018 IEEE Conference on Multimedia Information Processing And Retrieval(MIPR).Miami:IEEE Press,2013:170-173. [53]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105. [54]ZEILER M D,FERGUS R.Visualizing and understanding con-volutional networks[C]//Proceedings of the European Confe-rence on Computer Vision.Zurich:Springer,2014:818-833. [55]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:1-9. [56]HARRAG F,GUELIANI S.Event Extraction Based on DeepLearning in Food Hazard Arabic Texts[J].arXiv:2008.05014,2020. [57]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013. [58]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [59]LIN X,PARIKH D.Leveraging visual question answering forimage-caption ranking[C]//Proceedings of the European Conference on Computer Vision.Amsterdam:Springer,2016:261-277. [60]ZHANG P,GOYAL Y,SUMMERS-STAY D,et al.Yin andYang:Balancing and Answering Binary Visual Questions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:5014-5022. [61]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Proceedings of the European Conference on Computer Vision.Zurich:Springer,2014:740-755. [62]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic infe-rence over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2(1):67-78. [63]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in visualquestion answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6904-6913. [64]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE Press,2019:6700-6709. [65]AGRAWAL A,BATRA D,PARIKH D,et al.Don’t justassume;look and answer:Overcoming priors for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:4971-4980. [66]LI Q,TAO Q,JOTY S,et al.Vqa-e:Explaining,elaborating,and enhancing your answers for visual questions[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:552-567. [67]MARINO K,RASTEGARI M,FARHADI A,et al.Ok-vqa:Avisual question answering benchmark requiring external know-ledge[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE Press,2019:3195-3204. [68]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [69]AUER S,BIZER C,KOBILAROV G,et al.Dbpedia:A nucleus for a web of open data[C]//Proceedings of the 6th International the Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference.Springer,2007:722-735. [70]TANDON N,DE MELO G,WEIKUM G.Acquiring comparative commonsense knowledge from the web[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Québec:AAAI Press,2014:166-172. [71]SANTORO A,RAPOSO D,BARRETT D G,et al.A simple neural network module for relational reasoning[J/OL].Advances in Neural Information Processing Systems,2017,30.https://www.researchgate.net/publication/317356629_A_sim-ple_neural_network_module_for_relational_reasoning. [72]MAO J,GAN C,KOHLI P,et al.The neuro-symbolic concept learner:Interpreting scenes,words,and sentences from natural supervision[J].arXiv:1904.12584,2019. [73]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:2901-2910. [74]HONG Y,YI L,TENENBAUM J,et al.PTR:A Benchmark for Part-based Conceptual,Relational,and Physical Reasoning[J/OL].Advances in Neural Information Processing Systems,2021,34.https://arxiv.org/abs/2112.05136. [75]HUANG D,CHEN P,ZENG R,et al.Location-aware graphconvolutional networks for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11021-11028. [76]DING M,CHEN Z,DU T,et al.Dynamic visual reasoning by learning differentiable physics models from video and language[J/OL].Advances in Neural Information Processing Systems,2021,34.https://arxiv.org/abs/2110.15358. [77]GUO D,WANG H,ZHANG H,et al.Iterative context-aware graph inference for visual dialog[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10055-10064. |
[1] | YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng. Visual Question Answering Method Based on Counterfactual Thinking [J]. Computer Science, 2022, 49(12): 229-235. |
[2] | NIU Yu-lei, ZHANG Han-wang. Survey on Visual Question Answering and Dialogue [J]. Computer Science, 2021, 48(3): 87-96. |
[3] | XU Sheng, ZHU Yong-xin. Study on Question Processing Algorithms in Visual Question Answering [J]. Computer Science, 2020, 47(11): 226-230. |
|