基于层次化视觉注意力的富语义视频对话生成

doi:10.11896/jsjkx.231100107

Abstract

Abstract: As an important research direction in the field of multimodal human-computer interaction,video dialogue emerges.The large amount of temporal and spatial visual information and complex multimodal relationships makes it challenging to design efficient video dialogue systems.Existing video dialogue systems utilize cross-modal attention mechanisms or graph structures to capture the correlation between video semantics and dialogue context.However,all visual information is processed with a single coarse granularity.It results in a loss of some fine-grained temporal and spatial information,such as the continuous motion of the same object and the insignificant position information of an image.Moreover,the fine-grained process of all visual information increases the delay and degrades the dialogue fluency.Therefore,we propose a hierarchical visual attention-based semantic-rich video dialogue generation method in this paper.Firstly,according to the dialogue context,global visual semantic information is captured by using global visual attention and located to the time sequence/spatial scope of the video associated with the dialogue input.Secondly,the local attention mechanism is used to further capture fine-grained visual information in the localized area,and to generate the dialogue response by exploiting the multi-task learning method.Experimental results on DSTC7 AVSD datasets show that the dialogue generated by the proposed method has higher accuracy and variety,and its METEOR index improves by 23.24%.

Key words: Multi-modal human-computer interaction, Hierarchical attention mechanism, Multi-task learning, Scene perception

CLC Number:

TP391

ZHAO Qian, GUO Bin, LIU Yubo, SUN Zhuo, WANG Hao, CHEN Mengqi. Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention[J].Computer Science, 2025, 52(1): 315-322.

References

[1]XU W,DAINOFF M J,GE L,et al.Transitioning to human interaction with AI systems:New challenges and opportunities for HCI professionals to enable human-centered AI[J].Interna-tional Journal of Human-Computer Interaction,2023,39(3):494-518.
[2]YUSUF A A,FENG C,MAO X L.An analysis of graph con-volutional networks and recent datasets for visual question answering[J].Artificial Intelligence Review,2022,55(8):6277-6300.
[3]LIN X,BERTASIUS G,WANG J,et al.Vx2text:End-to-endlearning of video-based text generation from multimodal inputs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:7005-7015.
[4]WANG H Y,HUANG J Y,LEE W P.Integrating Scene Image and Conversational Text to Develop Human-Machine Dialogue[J].International Journal of Semantic Computing,2022,16(3):425-447.
[5]SERBAN I V,SORDONI A,LOWE R,et al.A hierarchical latent variable encoder-decoder model for generating dialogues[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3295-3301.
[6]SERBAN I V,SORDONI A,BENGIO Y,et al.Building end-to-end dialogue systems using generative hierarchical neural network models[C]//Proceedings of the Thirtieth AAAI Confe-rence on Artificial Intelligence.Phoenix:AAAI Press,2016:3776-3783.
[7]WESTON J.Dialog-based language learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Barcelona:AAAI Press,2016:829-837.
[8]XING C,WU Y,WU W,et al.Hierarchical recurrent attention network for response generation[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:5610-5617.
[9]ZHOU G,LUO P,CAO R,et al.Mechanism-aware neural machine for dialogue response generation[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:3400-3406.
[10]WU Y,WU W,XING C,et al.Sequential Matching Network:A New Architecture for Multi-turn Response Selection in Retrie-val-Based Chatbots[C]//Proceedings of the 55th Annual Mee-ting of the Association for Computational Linguistics.Vancouver:Association for Computational Linguistics,2017:496-505.
[11]LIU X,ZHENG Y,DU Z,et al.GPT understands,too[J].ar-Xiv:2103.10385,2023.
[12]NAZIR A,WANG Z.A Comprehensive Survey of ChatGPT:Advancements,Applications,Prospects,and Challenges[J].Metaradiology,2023,1(4):100022.
[13]WU T,HE S,LIU J,et al.A brief overview of ChatGPT:The history,status quo and potential future development[J].IEEE/CAA Journal of Automatica Sinica,2023,10(5):1122-1136.
[14]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:3165-3173.
[15]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:2758-2766.
[16]GARCIA N,NAKASHIMA Y.Knowledge-based video question answering with unsupervised scene descriptions[C]//European Conference on Computer Vision.Glasgow:Springer,2020:581-598.
[17]LE H,CHEN N,HOI S.Vgnmn:Video-grounded neural module networks for video-grounded dialogue systems[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Seattle:Association for Computational Linguistics,2022:3377-3393.
[18]HAQUE M F,LIM H Y,KANG D S.Object detection based on VGG with ResNet network[C]//International Conference on Electronics,Information,and Communication(ICEIC).Auckland:IEEE Press,2019:1-3.
[19]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6299-6308.
[20]CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162.
[21]LAGLER K,SCHINDELEGGER M,BÖHM J,et al.GPT2:Empirical slant delay model for radio space geodetic techniques[J].Geophysical Research Letters,2013,40(6):1069-1073.
[22]BHATTACHARJEE D,ZHANG T,SÜSSTRUNK S,et al.Mult:an end-to-end multitask learning transformer [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:12031-12041.
[23]YE M,YOU Q,MA F.QUALIFIER:Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Visison.Waikoloa:IEEE Press,2022:248-256.
[24]HORI C,ALAMRI H,WANG J,et al.End-to-end audio visualscene-aware dialog using multimodal attention-based video features[C]//ICASSP 2019－2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Brighton:IEEE Press,2019:2352-2356.
[25]LE H,SAHOO D,CHEN N F,et al.Multimodal transformernetworks for end-to-end video-grounded dialogue systems[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:5612-5623.
[26]LE H,SAHOO D,CHEN N,et al.BiST:Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Stroudsburg:Association for Computational Linguistics,2020:1846-1859.
[27]GENG S,GAO P,CHATTERJEE M,et al.Dynamic graph representation learning for video dialog via multi-modal shuffled transformers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.California:AAAI Press,2021:1415-1423.
[28]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:311-318.
[29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of theWorkshop on Text Summarization Branches Out.Barcelona:Springer,2004:74-81.
[30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor:Association for Computational Linguistics,2005:65-72.
[31]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:4566-4575.

Related Articles 15

[1]	ZHANG Haoyan, DUAN Liguo, WANG Qinchen, GAO Hao. Long Text Multi-entity Sentiment Analysis Based on Multi-task Joint Training [J]. Computer Science, 2024, 51(6): 309-316.
[2]	LIU Zeyu, LIU Jianwei. Video and Image Salient Object Detection Based on Multi-task Learning [J]. Computer Science, 2024, 51(4): 217-228.
[3]	ZHANG Xue, TIAN Lan, ZENG Ming, LIU Junhui, ZONG Shaoguo. Multitask Classification Algorithm of ECG Signals Based on Radient Magnitude Direction Adjustment [J]. Computer Science, 2024, 51(12): 174-180.
[4]	FU Mingrui, LI Weijiang. Multi-task Emotion-Cause Pair Extraction Method Based on Position-aware Interaction Network [J]. Computer Science, 2024, 51(11A): 231000086-9.
[5]	WANG Kunyang, LIU Yang, YE Ning, ZHANG Kai. Road Extraction from Complex Urban Remote Sensing Images Based on Multi-task Learning [J]. Computer Science, 2024, 51(11A): 240300095-8.
[6]	XU Bei, XU Peng. Emotion Elicited Question Generation Model in Dialogue Scenarios [J]. Computer Science, 2024, 51(11): 265-272.
[7]	ZHANG Xiaoyun, ZHAO Hui. Study on Multi-task Student Emotion Recognition Methods Based on Facial Action Units [J]. Computer Science, 2024, 51(10): 105-111.
[8]	LUO Huilan, YE Ju. Study of Multi-task Learning with Joint Semantic Segmentation and Depth Estimation [J]. Computer Science, 2023, 50(6A): 220100111-10.
[9]	ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[10]	DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[11]	ZHAO Kai, AN Wei-chao, ZHANG Xiao-yu, WANG Bin, ZHANG Shan, XIANG Jie. Intracerebral Hemorrhage Image Segmentation and Classification Based on Multi-taskLearning of Shared Shallow Parameters [J]. Computer Science, 2022, 49(4): 203-208.
[12]	YANG Xiao-yu, YIN Kang-ning, HOU Shao-qi, DU Wen-yi, YIN Guang-qiang. Person Re-identification Based on Feature Location and Fusion [J]. Computer Science, 2022, 49(3): 170-178.
[13]	ZHENG Shun-yuan, HU Liang-xiao, LYU Xiao-qian, SUN Xin, ZHANG Sheng-ping. Edge Guided Self-correction Skin Detection [J]. Computer Science, 2022, 49(11): 141-147.
[14]	SONG Long-ze, WAN Huai-yu, GUO Sheng-nan, LIN You-fang. Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction [J]. Computer Science, 2021, 48(7): 112-117.
[15]	LIU Xiao-long, HAN Fang, WANG Zhi-jie. Joint Question Answering Model Based on Knowledge Representation [J]. Computer Science, 2021, 48(6): 241-245.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention

PDF (PC)