Computer Science ›› 2025, Vol. 52 ›› Issue (1): 315-322.doi: 10.11896/jsjkx.231100107

• Artificial Intelligence • Previous Articles     Next Articles

Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention

ZHAO Qian, GUO Bin, LIU Yubo, SUN Zhuo, WANG Hao, CHEN Mengqi   

  1. School of Computer Science,Northwestern Polytechnical University,Xi’an 710129,China
  • Received:2023-11-19 Revised:2024-05-06 Online:2025-01-15 Published:2025-01-09
  • About author:ZHAO Qian,born in 2001,postgra-duate,is a member of CCF(No.P2226G).Her main research interest is visual human-computer dialogue.
    GUO Bin,born in 1980,Ph.D,professor,doctoral supervisor.His main research interests include ubiquitous computing,mobile crowd sensing,big data intelligence and so on.
  • Supported by:
    National Science Fundation for Distinguished Young Scholars of China(62025205) and National Natural Science Foundation of China(62032020,62102322).

Abstract: As an important research direction in the field of multimodal human-computer interaction,video dialogue emerges.The large amount of temporal and spatial visual information and complex multimodal relationships makes it challenging to design efficient video dialogue systems.Existing video dialogue systems utilize cross-modal attention mechanisms or graph structures to capture the correlation between video semantics and dialogue context.However,all visual information is processed with a single coarse granularity.It results in a loss of some fine-grained temporal and spatial information,such as the continuous motion of the same object and the insignificant position information of an image.Moreover,the fine-grained process of all visual information increases the delay and degrades the dialogue fluency.Therefore,we propose a hierarchical visual attention-based semantic-rich video dialogue generation method in this paper.Firstly,according to the dialogue context,global visual semantic information is captured by using global visual attention and located to the time sequence/spatial scope of the video associated with the dialogue input.Secondly,the local attention mechanism is used to further capture fine-grained visual information in the localized area,and to generate the dialogue response by exploiting the multi-task learning method.Experimental results on DSTC7 AVSD datasets show that the dialogue generated by the proposed method has higher accuracy and variety,and its METEOR index improves by 23.24%.

Key words: Multi-modal human-computer interaction, Hierarchical attention mechanism, Multi-task learning, Scene perception

CLC Number: 

  • TP391
[1]XU W,DAINOFF M J,GE L,et al.Transitioning to human interaction with AI systems:New challenges and opportunities for HCI professionals to enable human-centered AI[J].Interna-tional Journal of Human-Computer Interaction,2023,39(3):494-518.
[2]YUSUF A A,FENG C,MAO X L.An analysis of graph con-volutional networks and recent datasets for visual question answering[J].Artificial Intelligence Review,2022,55(8):6277-6300.
[3]LIN X,BERTASIUS G,WANG J,et al.Vx2text:End-to-endlearning of video-based text generation from multimodal inputs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:7005-7015.
[4]WANG H Y,HUANG J Y,LEE W P.Integrating Scene Image and Conversational Text to Develop Human-Machine Dialogue[J].International Journal of Semantic Computing,2022,16(3):425-447.
[5]SERBAN I V,SORDONI A,LOWE R,et al.A hierarchical latent variable encoder-decoder model for generating dialogues[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3295-3301.
[6]SERBAN I V,SORDONI A,BENGIO Y,et al.Building end-to-end dialogue systems using generative hierarchical neural network models[C]//Proceedings of the Thirtieth AAAI Confe-rence on Artificial Intelligence.Phoenix:AAAI Press,2016:3776-3783.
[7]WESTON J.Dialog-based language learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Barcelona:AAAI Press,2016:829-837.
[8]XING C,WU Y,WU W,et al.Hierarchical recurrent attention network for response generation[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:5610-5617.
[9]ZHOU G,LUO P,CAO R,et al.Mechanism-aware neural machine for dialogue response generation[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3400-3406.
[10]WU Y,WU W,XING C,et al.Sequential Matching Network:A New Architecture for Multi-turn Response Selection in Retrie-val-Based Chatbots[C]//Proceedings of the 55th Annual Mee-ting of the Association for Computational Linguistics.Vancouver:Association for Computational Linguistics,2017:496-505.
[11]LIU X,ZHENG Y,DU Z,et al.GPT understands,too[J].ar-Xiv:2103.10385,2023.
[12]NAZIR A,WANG Z.A Comprehensive Survey of ChatGPT:Advancements,Applications,Prospects,and Challenges[J].Metaradiology,2023,1(4):100022.
[13]WU T,HE S,LIU J,et al.A brief overview of ChatGPT:The history,status quo and potential future development[J].IEEE/CAA Journal of Automatica Sinica,2023,10(5):1122-1136.
[14]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:3165-3173.
[15]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:2758-2766.
[16]GARCIA N,NAKASHIMA Y.Knowledge-based video question answering with unsupervised scene descriptions[C]//European Conference on Computer Vision.Glasgow:Springer,2020:581-598.
[17]LE H,CHEN N,HOI S.Vgnmn:Video-grounded neural module networks for video-grounded dialogue systems[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Seattle:Association for Computational Linguistics,2022:3377-3393.
[18]HAQUE M F,LIM H Y,KANG D S.Object detection based on VGG with ResNet network[C]//International Conference on Electronics,Information,and Communication(ICEIC).Auckland:IEEE Press,2019:1-3.
[19]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6299-6308.
[20]CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162.
[21]LAGLER K,SCHINDELEGGER M,BÖHM J,et al.GPT2:Empirical slant delay model for radio space geodetic techniques[J].Geophysical Research Letters,2013,40(6):1069-1073.
[22]BHATTACHARJEE D,ZHANG T,SÜSSTRUNK S,et al.Mult:an end-to-end multitask learning transformer [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:12031-12041.
[23]YE M,YOU Q,MA F.QUALIFIER:Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Visison.Waikoloa:IEEE Press,2022:248-256.
[24]HORI C,ALAMRI H,WANG J,et al.End-to-end audio visualscene-aware dialog using multimodal attention-based video features[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Brighton:IEEE Press,2019:2352-2356.
[25]LE H,SAHOO D,CHEN N F,et al.Multimodal transformernetworks for end-to-end video-grounded dialogue systems[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:5612-5623.
[26]LE H,SAHOO D,CHEN N,et al.BiST:Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Stroudsburg:Association for Computational Linguistics,2020:1846-1859.
[27]GENG S,GAO P,CHATTERJEE M,et al.Dynamic graph representation learning for video dialog via multi-modal shuffled transformers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.California:AAAI Press,2021:1415-1423.
[28]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:311-318.
[29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of theWorkshop on Text Summarization Branches Out.Barcelona:Springer,2004:74-81.
[30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor:Association for Computational Linguistics,2005:65-72.
[31]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:4566-4575.
[1] ZHANG Haoyan, DUAN Liguo, WANG Qinchen, GAO Hao. Long Text Multi-entity Sentiment Analysis Based on Multi-task Joint Training [J]. Computer Science, 2024, 51(6): 309-316.
[2] LIU Zeyu, LIU Jianwei. Video and Image Salient Object Detection Based on Multi-task Learning [J]. Computer Science, 2024, 51(4): 217-228.
[3] ZHANG Xue, TIAN Lan, ZENG Ming, LIU Junhui, ZONG Shaoguo. Multitask Classification Algorithm of ECG Signals Based on Radient Magnitude Direction Adjustment [J]. Computer Science, 2024, 51(12): 174-180.
[4] FU Mingrui, LI Weijiang. Multi-task Emotion-Cause Pair Extraction Method Based on Position-aware Interaction Network [J]. Computer Science, 2024, 51(11A): 231000086-9.
[5] WANG Kunyang, LIU Yang, YE Ning, ZHANG Kai. Road Extraction from Complex Urban Remote Sensing Images Based on Multi-task Learning [J]. Computer Science, 2024, 51(11A): 240300095-8.
[6] XU Bei, XU Peng. Emotion Elicited Question Generation Model in Dialogue Scenarios [J]. Computer Science, 2024, 51(11): 265-272.
[7] ZHANG Xiaoyun, ZHAO Hui. Study on Multi-task Student Emotion Recognition Methods Based on Facial Action Units [J]. Computer Science, 2024, 51(10): 105-111.
[8] LUO Huilan, YE Ju. Study of Multi-task Learning with Joint Semantic Segmentation and Depth Estimation [J]. Computer Science, 2023, 50(6A): 220100111-10.
[9] ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[10] DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[11] ZHAO Kai, AN Wei-chao, ZHANG Xiao-yu, WANG Bin, ZHANG Shan, XIANG Jie. Intracerebral Hemorrhage Image Segmentation and Classification Based on Multi-taskLearning of Shared Shallow Parameters [J]. Computer Science, 2022, 49(4): 203-208.
[12] YANG Xiao-yu, YIN Kang-ning, HOU Shao-qi, DU Wen-yi, YIN Guang-qiang. Person Re-identification Based on Feature Location and Fusion [J]. Computer Science, 2022, 49(3): 170-178.
[13] ZHENG Shun-yuan, HU Liang-xiao, LYU Xiao-qian, SUN Xin, ZHANG Sheng-ping. Edge Guided Self-correction Skin Detection [J]. Computer Science, 2022, 49(11): 141-147.
[14] SONG Long-ze, WAN Huai-yu, GUO Sheng-nan, LIN You-fang. Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction [J]. Computer Science, 2021, 48(7): 112-117.
[15] LIU Xiao-long, HAN Fang, WANG Zhi-jie. Joint Question Answering Model Based on Knowledge Representation [J]. Computer Science, 2021, 48(6): 241-245.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!