计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 269-275.doi: 10.11896/jsjkx.240700136
王元龙, 张宁倩, 张虎
WANG Yuanlong, ZHANG Ningqian, ZHANG Hu
摘要: 近年来,视觉故事生成受到越来越多的计算机视觉和自然语言处理领域学者的关注。现有模型大多侧重于增强图像表示,例如引入外部知识、场景图等,虽然取得了一些进展,但生成的故事仍存在内容重复使用和细节描述少的问题。针对上述问题,提出了基于规划学习的视觉故事生成模型1),引入规划学习方法,从主题、对象、动作、地点、推理、预测6个维度设定对应的问题,利用视觉问答预训练语言模型生成答案,完成规划设计,引导视觉故事生成。模型分为4阶段:第一阶段从图片中提取视觉信息;第二阶段通过概念生成器抽取并选择相关概念;第三阶段利用预训练语言模型引导规划信息生成;第四阶段融合前3个阶段生成的视觉、概念和规划信息,完成视觉故事生成任务。在公开数据集VIST上验证所提模型的效果,与现有模型COVS相比,其在BLEU-1,BLEU-2,ROUGE_L,Distinct-3,Distinct-4和TTR指标上提升了1.58百个分点、2.7百个分点、0.4百个分点、2.2百个分点、3.6百个分点和5.6百个分点。
中图分类号:
[1]WANG R,WEI Z,LI P,et al.Story telling from an ImageStream Using Scene Graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:9185-9192. [2]HSU C C,CHEN Z Y,HSU C Y,et al.Knowledge-Enriched Visual Storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:7952-7960. [3]LI M M,JIANG A W,LONG Y Z,et al.Visual story generation algorithm based on fine-grained visual features and knowledge graph[J].Journal of Chinese Information Technology,2022,36(9):139-148. [4]GU J,WANG H,FAN R.Coherent Visual Storytelling viaParallel Top-Down Visual and Topic Attention[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,33(1):257-268. [5]LIU D,LAPATA M,KELLER F.Visual Storytelling withQuestion-Answer Plans[M]//Findings of the Association for Computational Linguistics:EMNLP 2023.ACL,2023:5800-5813. [6]CHENG S,GUO Z,WU J,et al.Ego Think:Evaluating First-Person Perspective Thinking Capability of Vision-Language Models[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:14291-14302. [7]HUANG T H,FERRARO F,MOSTAFAZAD EH N,et al.Visual storytelling[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:1233-1239. [8]KIM T,HEO M O,SON S,et al.GLAC Net:GLocal Attention Cascading Networks for Multi-image Cued Story Generation[J].arXiv:1805.10973,2018. [9]WANG J,FU J,TANG J,et al.Show,Reward and Tell:Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training[C]//AAAI Conference on Artificial Intelligence.2018:7396-74003. [10]CHEN H,HUANG Y,TAKAMURA H,et al.Commonsenseknowledge aware concept selection for diverse and informative visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:999-1008. [11]HSU C Y,CHU Y W,HUANG T H K,et al.Plot and Rework:Modeling Storylines for Visual Storytelling[C]//Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing.2021:4443-4453. [12]NARAYAN S,MAYNEZ J,AMPLAYO R K,et al.Conditional generation with a Question-Answering Blueprint[J].Transactions of the Association for Computational Linguistics,2023,11:974-996. [13]REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149 [14]LI Z,YANG B,LIU Q,et al.Monkey:Image Resolution andText Label Are Important Things for Large Multi-modal Mo-dels[C]//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:26753-26763. [15]PAPINENI K,ROUKOS S,WARD T,et al.BLEU:A methodfor automatic evaluation of machine translation[C]//Procee-dings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318. [16]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72. [17]LIN C Y,OCH F J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.2004:21-26. [18]VEDANTAM R,ZITNICK C L,PARIKH D.C-IDEr:Consensus based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:4566-4575. [19]LI J,GALLEY M,BROCKETT C,et al.A Diversity-Promoting objective function for neural conversation conversation Models[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:110-119. [20]CONNORS L H,LIM A,PROKAEVAT,et al.Tabulation of human transthyretin(TTR)variants[J].Amyloid,2003,10(3):160-84. [21]WANG X,CHEN W,WANG Y F,et al.No metrics are perfect:Adversarial reward learning for visual storytelling[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.ACL,2018:899-909 [22]WANG E,HAN S C,POON J.SCO-VIST:Social Interaction Commonsense Knowledge-based Visual Storytelling[C]//Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics.2024:1602-1616. |
|