Computer Science ›› 2021, Vol. 48 ›› Issue (3): 71-78.doi: 10.11896/jsjkx.201100176
Special Issue: Advances on Multimedia Technology
• Advances on Multimedia Technology • Previous Articles Next Articles
WU A-ming, JIANG Pin, HAN Ya-hong
CLC Number:
[1]TURING A M.Computing machinery and intelligence[J].Mind,1950,59(236):433-460. [2]TENEY D,ANDERSON P,HE X,et al.Tips and tricks for visual question answering:Learnings from the 2017 challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232. [3]JABRI A,JOULIN A,VAN DER MAATEN L.Revisiting visual question answering baselines[C]//European Conference on Computer Vision.Springer,Cham,2016:727-739. [4]ZHU L,XU Z,YANG Y,et al.Uncovering the temporal context for video question answering[J].International Journal of Computer Vision,2017,124(3):409-421. [5]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731. [6]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40. [7]DRUZHKOV P N,KUSTIKOVA V D.A survey of deep learning methods and software tools for image classification and object detection[J].Pattern Recognition and Image Analysis,2016,26(1):9-15. [8]YANG S,WANG Y,CHU X.A Survey of Deep Learning Techniques for Neural Machine Translation[J].arXiv:2002.07526,2020. [9]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468. [10]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Advances in Neural Information Processing Systems.2016:289-297. [11]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290. [12]NGUYEN B D,DO T T,NGUYEN B X,et al.Overcoming data limitation in medical visual question answering[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer,Cham,2019:522-530. [13]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335. [14]SEO P H,LEHRMANN A,HAN B,et al.Visual reference resolution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729. [15]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual question answering [C]//Proceedings of IEEE Conference on Computer Vision.New York:IEEE Press,2015:2425-2433. [16]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961. [17]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621. [18]KIM J H,LEE S W,KWAK D,et al.Multimodal residual learn-ing for visual qa[C]//Advances in Neural Information Processing Systems.2016:361-369. [19]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663. [20]CHARIKAR M,CHEN K,FARACH-COLTON M.Finding frequent items in data streams[C]//International Colloquium on Automata,Languages,and Programming.Berlin,Heidelberg:Springer,2002:693-703. [21]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[C]//In ICLR.2016. [22]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2612-2620. [23]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830. [24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bilinear superdiagonal fusion for visual question answering and visual relationship detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8102-8109. [25]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural image caption generation with visual attention[C]//International Conference on Machine Learning.2015:2048-2057. [26]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29. [27]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663. [28]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [29]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015:91-99. [30]SCHWARTZ I,SCHWING A,HAZAN T.High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems.2017:3664-3674. [31]LI Y,KAISER L,BENGIO S,et al.Area attention[C]//International Conference on Machine Learning.PMLR,2019:3846-3855. [32]PATRO B,NAMBOODIRI V P.Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7680-7688. [33]GUO W,ZHANG Y,WU X,et al.Re-Attention for VisualQuestion Answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:91-98. [34]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neuralmodule networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48. [35]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813. [36]HUDSON D A,MANNING C D.Compositional Attention Networks for Machine Reasoning[C]//International Conference on Learning Representations.2018. [37]GAO P,YOU H,ZHANG Z,et al.Multi-modality latent interaction network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:5825-5835. [38]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998. [39]GAO P,JIANG Z,YOU H,et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6639-6648. [40]KIPF T N,WELLING M.Semi-Supervised Classification with Graph Convolutional Networks[C]//International Conference on Learning Representations.2016. [41]VELIKOVI P,CUCURULL G,CASANOVA A,et al.GraphAttention Networks[C]//International Conference on Learning Representations.2018. [42]MONTI F,BOSCAINI D,MASCI J,et al.Geometric deep learning on graphs and manifolds using mixture model cnns[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5115-5124. [43]TENEY D,LIU L,VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1-9. [44]NORCLIFFE-BROWN W,VAFEIAS S,PARISOT S.Learning conditioned graph structures for interpretable visual question answering[C]//Advances in Neural Information Processing Systems.2018:8334-8343. [45]HU R,ROHRBACH A,DARRELL T,et al.Language-condi-tioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303. [46]KHADEMI M.Multimodal Neural Graph Memory Networksfor Visual Question Answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7177-7188. [47]SUKHBAATAR S,WESTON J,FERGUS R.End-to-end memory networks[C]//Advances in Neural Information Processing Systems.2015:2440-2448. [48]HUDSON D,MANNING C D.Learning by abstraction:Theneural state machine[C]//Advances in Neural Information Processing Systems.2019:5903-5916. [49]HAN Y,WANG B,HONG R,et al.Movie question answering via textual memory and plot graph[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(3):875-887. [50]WANG B,XU Y,HAN Y,et al.Movie question answering:Remembering the textual cues for layered visual contents[J].ar-Xiv:1804.09412,2018. [51]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [52]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4631-4640. [53]GAO J,GE R,CHEN K,et al.Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6576-6585. [54]KIM J,MA M,KIM K,et al.Progressive attention memory network for movie story question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8337-8346. [55]LI X,SONG J,GAO L,et al.Beyond rnns:Positional self-attention with co-attention for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8658-8665. [56]KIM J,MA M,PHAM T,et al.Modality Shifting Attention Network for Multi-Modal Video Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10106-10115. [57]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639. [58]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:684-699. [59]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667. [60]JIANG P,HAN Y.Reasoning with Heterogeneous GraphAlignment for Video Question Answering[C]//AAAI.2020:11109-11116. [61]SONG X,SHI Y,CHEN X,et al.Explore multi-step reasoning in video question answering[C]//Proceedings of the 26th ACM International Conference on Multimedia.2018:239-247. [62]WU A,ZHU L,HAN Y,et al.Connective Cognition Networkfor Directional Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:5669-5679. [63]YU W,ZHOU J,YU W,et al.Heterogeneous Graph Learning for Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:2769-2779. [64]LIN J,JAIN U,SCHWING A G.TAB-VCR:Tags and Attributes based Visual Commonsense Reasoning Baselines[C]//Advances in Neural Information Processing Systems.2019. [65]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long and Short Papers).2018:4171-4186. [66]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23. [67]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[C]//International Conference on Learning Representations.2020. [68]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913. [69]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning tocount objects in natural images for visual question answering[J].arXiv:1802.05766,2018. [70]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959. [71]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1564-1574. [72]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2758-2766. [73]FAN C,ZHANG X,ZHANG S,et al.Heterogeneous memoryenhanced multimodal attention model for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1999-2007. [74]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1532-1543. [75]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252. [76]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. |
[1] | ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63. |
[2] | DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145. |
[3] | ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161. |
[4] | XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182. |
[5] | RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207. |
[6] | ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119. |
[7] | SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177. |
[8] | YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236. |
[9] | WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48. |
[10] | JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335. |
[11] | HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163. |
[12] | JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186. |
[13] | XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219. |
[14] | PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247. |
[15] | ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105. |
|