计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 315-322.doi: 10.11896/jsjkx.231100107

• 人工智能 • 上一篇    下一篇

基于层次化视觉注意力的富语义视频对话生成

赵倩, 郭斌, 刘宇博, 孙卓, 王豪, 陈梦琦   

  1. 西北工业大学计算机学院 西安 710129
  • 收稿日期:2023-11-19 修回日期:2024-05-06 出版日期:2025-01-15 发布日期:2025-01-09
  • 通讯作者: 郭斌(guob@nwpu.edu.cn)
  • 作者简介:(qzhao@mail.nwpu.edu.cn)
  • 基金资助:
    国家杰出青年科学基金(62025205);国家自然科学基金(62032020,62102322)

Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention

ZHAO Qian, GUO Bin, LIU Yubo, SUN Zhuo, WANG Hao, CHEN Mengqi   

  1. School of Computer Science,Northwestern Polytechnical University,Xi’an 710129,China
  • Received:2023-11-19 Revised:2024-05-06 Online:2025-01-15 Published:2025-01-09
  • About author:ZHAO Qian,born in 2001,postgra-duate,is a member of CCF(No.P2226G).Her main research interest is visual human-computer dialogue.
    GUO Bin,born in 1980,Ph.D,professor,doctoral supervisor.His main research interests include ubiquitous computing,mobile crowd sensing,big data intelligence and so on.
  • Supported by:
    National Science Fundation for Distinguished Young Scholars of China(62025205) and National Natural Science Foundation of China(62032020,62102322).

摘要: 视频对话是多模态人机交互领域中的重要内容。视频对话中包含大量时空视觉信息和复杂的多模态关系,这给相关研究带来了巨大的挑战。现有的视频对话模型利用跨模态注意力机制或图结构捕捉视频语义和对话上下文之间的相关性,然而,所有视觉信息均是在单一粗粒度下处理的,这导致模型容易忽略一些细粒度时空信息,如同一物体在时间上的持续运动或图像不显著位置的物体信息,从而降低了视频对话性能。同时,细粒度处理全部视觉信息又将增加处理时延,降低视频对话的流畅性。因此,提出了一种层次化视觉注意力的富语义视频对话生成方法。首先根据对话上下文,利用全局视觉注意力捕捉全局视觉语义信息,并定位到对话输入关注的视频时间序列/空间范围,其次利用局部注意力机制进一步捕捉细粒度视觉信息,结合多任务学习方法,生成对话回复。在DSTC7 AVSD数据集上的实验结果表明,相比现有基准方法,所提方法生成的对话具备更高的准确性和多样性,其中METEOR指标提高了23.24%。

关键词: 多模态人机交互, 层次化注意力机制, 多任务学习, 场景感知

Abstract: As an important research direction in the field of multimodal human-computer interaction,video dialogue emerges.The large amount of temporal and spatial visual information and complex multimodal relationships makes it challenging to design efficient video dialogue systems.Existing video dialogue systems utilize cross-modal attention mechanisms or graph structures to capture the correlation between video semantics and dialogue context.However,all visual information is processed with a single coarse granularity.It results in a loss of some fine-grained temporal and spatial information,such as the continuous motion of the same object and the insignificant position information of an image.Moreover,the fine-grained process of all visual information increases the delay and degrades the dialogue fluency.Therefore,we propose a hierarchical visual attention-based semantic-rich video dialogue generation method in this paper.Firstly,according to the dialogue context,global visual semantic information is captured by using global visual attention and located to the time sequence/spatial scope of the video associated with the dialogue input.Secondly,the local attention mechanism is used to further capture fine-grained visual information in the localized area,and to generate the dialogue response by exploiting the multi-task learning method.Experimental results on DSTC7 AVSD datasets show that the dialogue generated by the proposed method has higher accuracy and variety,and its METEOR index improves by 23.24%.

Key words: Multi-modal human-computer interaction, Hierarchical attention mechanism, Multi-task learning, Scene perception

中图分类号: 

  • TP391
[1]XU W,DAINOFF M J,GE L,et al.Transitioning to human interaction with AI systems:New challenges and opportunities for HCI professionals to enable human-centered AI[J].Interna-tional Journal of Human-Computer Interaction,2023,39(3):494-518.
[2]YUSUF A A,FENG C,MAO X L.An analysis of graph con-volutional networks and recent datasets for visual question answering[J].Artificial Intelligence Review,2022,55(8):6277-6300.
[3]LIN X,BERTASIUS G,WANG J,et al.Vx2text:End-to-endlearning of video-based text generation from multimodal inputs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:7005-7015.
[4]WANG H Y,HUANG J Y,LEE W P.Integrating Scene Image and Conversational Text to Develop Human-Machine Dialogue[J].International Journal of Semantic Computing,2022,16(3):425-447.
[5]SERBAN I V,SORDONI A,LOWE R,et al.A hierarchical latent variable encoder-decoder model for generating dialogues[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3295-3301.
[6]SERBAN I V,SORDONI A,BENGIO Y,et al.Building end-to-end dialogue systems using generative hierarchical neural network models[C]//Proceedings of the Thirtieth AAAI Confe-rence on Artificial Intelligence.Phoenix:AAAI Press,2016:3776-3783.
[7]WESTON J.Dialog-based language learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Barcelona:AAAI Press,2016:829-837.
[8]XING C,WU Y,WU W,et al.Hierarchical recurrent attention network for response generation[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:5610-5617.
[9]ZHOU G,LUO P,CAO R,et al.Mechanism-aware neural machine for dialogue response generation[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3400-3406.
[10]WU Y,WU W,XING C,et al.Sequential Matching Network:A New Architecture for Multi-turn Response Selection in Retrie-val-Based Chatbots[C]//Proceedings of the 55th Annual Mee-ting of the Association for Computational Linguistics.Vancouver:Association for Computational Linguistics,2017:496-505.
[11]LIU X,ZHENG Y,DU Z,et al.GPT understands,too[J].ar-Xiv:2103.10385,2023.
[12]NAZIR A,WANG Z.A Comprehensive Survey of ChatGPT:Advancements,Applications,Prospects,and Challenges[J].Metaradiology,2023,1(4):100022.
[13]WU T,HE S,LIU J,et al.A brief overview of ChatGPT:The history,status quo and potential future development[J].IEEE/CAA Journal of Automatica Sinica,2023,10(5):1122-1136.
[14]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:3165-3173.
[15]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:2758-2766.
[16]GARCIA N,NAKASHIMA Y.Knowledge-based video question answering with unsupervised scene descriptions[C]//European Conference on Computer Vision.Glasgow:Springer,2020:581-598.
[17]LE H,CHEN N,HOI S.Vgnmn:Video-grounded neural module networks for video-grounded dialogue systems[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Seattle:Association for Computational Linguistics,2022:3377-3393.
[18]HAQUE M F,LIM H Y,KANG D S.Object detection based on VGG with ResNet network[C]//International Conference on Electronics,Information,and Communication(ICEIC).Auckland:IEEE Press,2019:1-3.
[19]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6299-6308.
[20]CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162.
[21]LAGLER K,SCHINDELEGGER M,BÖHM J,et al.GPT2:Empirical slant delay model for radio space geodetic techniques[J].Geophysical Research Letters,2013,40(6):1069-1073.
[22]BHATTACHARJEE D,ZHANG T,SÜSSTRUNK S,et al.Mult:an end-to-end multitask learning transformer [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:12031-12041.
[23]YE M,YOU Q,MA F.QUALIFIER:Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Visison.Waikoloa:IEEE Press,2022:248-256.
[24]HORI C,ALAMRI H,WANG J,et al.End-to-end audio visualscene-aware dialog using multimodal attention-based video features[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Brighton:IEEE Press,2019:2352-2356.
[25]LE H,SAHOO D,CHEN N F,et al.Multimodal transformernetworks for end-to-end video-grounded dialogue systems[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:5612-5623.
[26]LE H,SAHOO D,CHEN N,et al.BiST:Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Stroudsburg:Association for Computational Linguistics,2020:1846-1859.
[27]GENG S,GAO P,CHATTERJEE M,et al.Dynamic graph representation learning for video dialog via multi-modal shuffled transformers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.California:AAAI Press,2021:1415-1423.
[28]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:311-318.
[29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of theWorkshop on Text Summarization Branches Out.Barcelona:Springer,2004:74-81.
[30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor:Association for Computational Linguistics,2005:65-72.
[31]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:4566-4575.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!