计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 315-322.doi: 10.11896/jsjkx.231100107
赵倩, 郭斌, 刘宇博, 孙卓, 王豪, 陈梦琦
ZHAO Qian, GUO Bin, LIU Yubo, SUN Zhuo, WANG Hao, CHEN Mengqi
摘要: 视频对话是多模态人机交互领域中的重要内容。视频对话中包含大量时空视觉信息和复杂的多模态关系,这给相关研究带来了巨大的挑战。现有的视频对话模型利用跨模态注意力机制或图结构捕捉视频语义和对话上下文之间的相关性,然而,所有视觉信息均是在单一粗粒度下处理的,这导致模型容易忽略一些细粒度时空信息,如同一物体在时间上的持续运动或图像不显著位置的物体信息,从而降低了视频对话性能。同时,细粒度处理全部视觉信息又将增加处理时延,降低视频对话的流畅性。因此,提出了一种层次化视觉注意力的富语义视频对话生成方法。首先根据对话上下文,利用全局视觉注意力捕捉全局视觉语义信息,并定位到对话输入关注的视频时间序列/空间范围,其次利用局部注意力机制进一步捕捉细粒度视觉信息,结合多任务学习方法,生成对话回复。在DSTC7 AVSD数据集上的实验结果表明,相比现有基准方法,所提方法生成的对话具备更高的准确性和多样性,其中METEOR指标提高了23.24%。
中图分类号:
[1]XU W,DAINOFF M J,GE L,et al.Transitioning to human interaction with AI systems:New challenges and opportunities for HCI professionals to enable human-centered AI[J].Interna-tional Journal of Human-Computer Interaction,2023,39(3):494-518. [2]YUSUF A A,FENG C,MAO X L.An analysis of graph con-volutional networks and recent datasets for visual question answering[J].Artificial Intelligence Review,2022,55(8):6277-6300. [3]LIN X,BERTASIUS G,WANG J,et al.Vx2text:End-to-endlearning of video-based text generation from multimodal inputs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:7005-7015. [4]WANG H Y,HUANG J Y,LEE W P.Integrating Scene Image and Conversational Text to Develop Human-Machine Dialogue[J].International Journal of Semantic Computing,2022,16(3):425-447. [5]SERBAN I V,SORDONI A,LOWE R,et al.A hierarchical latent variable encoder-decoder model for generating dialogues[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3295-3301. [6]SERBAN I V,SORDONI A,BENGIO Y,et al.Building end-to-end dialogue systems using generative hierarchical neural network models[C]//Proceedings of the Thirtieth AAAI Confe-rence on Artificial Intelligence.Phoenix:AAAI Press,2016:3776-3783. [7]WESTON J.Dialog-based language learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Barcelona:AAAI Press,2016:829-837. [8]XING C,WU Y,WU W,et al.Hierarchical recurrent attention network for response generation[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:5610-5617. [9]ZHOU G,LUO P,CAO R,et al.Mechanism-aware neural machine for dialogue response generation[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3400-3406. [10]WU Y,WU W,XING C,et al.Sequential Matching Network:A New Architecture for Multi-turn Response Selection in Retrie-val-Based Chatbots[C]//Proceedings of the 55th Annual Mee-ting of the Association for Computational Linguistics.Vancouver:Association for Computational Linguistics,2017:496-505. [11]LIU X,ZHENG Y,DU Z,et al.GPT understands,too[J].ar-Xiv:2103.10385,2023. [12]NAZIR A,WANG Z.A Comprehensive Survey of ChatGPT:Advancements,Applications,Prospects,and Challenges[J].Metaradiology,2023,1(4):100022. [13]WU T,HE S,LIU J,et al.A brief overview of ChatGPT:The history,status quo and potential future development[J].IEEE/CAA Journal of Automatica Sinica,2023,10(5):1122-1136. [14]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:3165-3173. [15]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:2758-2766. [16]GARCIA N,NAKASHIMA Y.Knowledge-based video question answering with unsupervised scene descriptions[C]//European Conference on Computer Vision.Glasgow:Springer,2020:581-598. [17]LE H,CHEN N,HOI S.Vgnmn:Video-grounded neural module networks for video-grounded dialogue systems[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Seattle:Association for Computational Linguistics,2022:3377-3393. [18]HAQUE M F,LIM H Y,KANG D S.Object detection based on VGG with ResNet network[C]//International Conference on Electronics,Information,and Communication(ICEIC).Auckland:IEEE Press,2019:1-3. [19]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6299-6308. [20]CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162. [21]LAGLER K,SCHINDELEGGER M,BÖHM J,et al.GPT2:Empirical slant delay model for radio space geodetic techniques[J].Geophysical Research Letters,2013,40(6):1069-1073. [22]BHATTACHARJEE D,ZHANG T,SÜSSTRUNK S,et al.Mult:an end-to-end multitask learning transformer [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:12031-12041. [23]YE M,YOU Q,MA F.QUALIFIER:Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Visison.Waikoloa:IEEE Press,2022:248-256. [24]HORI C,ALAMRI H,WANG J,et al.End-to-end audio visualscene-aware dialog using multimodal attention-based video features[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Brighton:IEEE Press,2019:2352-2356. [25]LE H,SAHOO D,CHEN N F,et al.Multimodal transformernetworks for end-to-end video-grounded dialogue systems[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:5612-5623. [26]LE H,SAHOO D,CHEN N,et al.BiST:Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Stroudsburg:Association for Computational Linguistics,2020:1846-1859. [27]GENG S,GAO P,CHATTERJEE M,et al.Dynamic graph representation learning for video dialog via multi-modal shuffled transformers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.California:AAAI Press,2021:1415-1423. [28]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:311-318. [29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of theWorkshop on Text Summarization Branches Out.Barcelona:Springer,2004:74-81. [30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor:Association for Computational Linguistics,2005:65-72. [31]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:4566-4575. |
|