计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 269-275.doi: 10.11896/jsjkx.240700136

• 计算机图形学&多媒体 • 上一篇    下一篇

基于规划学习的视觉故事生成模型

王元龙, 张宁倩, 张虎   

  1. 山西大学计算机与信息技术学院 太原 030006
  • 收稿日期:2024-07-22 修回日期:2024-11-01 出版日期:2025-09-15 发布日期:2025-09-11
  • 通讯作者: 王元龙(ylwang007@163.com)
  • 基金资助:
    国家自然科学基金面上项目(62176145)

Visual Storytelling Based on Planning Learning

WANG Yuanlong, ZHANG Ningqian, ZHANG Hu   

  1. School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
  • Received:2024-07-22 Revised:2024-11-01 Online:2025-09-15 Published:2025-09-11
  • About author:WANG Yuanlong,born in 1982,Ph.D,associate professor,is a member of CCF(No.48432M).His main research interests include natural language processing and graphics image processing.
  • Supported by:
    National Natural Science Foundation of China(62176145).

摘要: 近年来,视觉故事生成受到越来越多的计算机视觉和自然语言处理领域学者的关注。现有模型大多侧重于增强图像表示,例如引入外部知识、场景图等,虽然取得了一些进展,但生成的故事仍存在内容重复使用和细节描述少的问题。针对上述问题,提出了基于规划学习的视觉故事生成模型1),引入规划学习方法,从主题、对象、动作、地点、推理、预测6个维度设定对应的问题,利用视觉问答预训练语言模型生成答案,完成规划设计,引导视觉故事生成。模型分为4阶段:第一阶段从图片中提取视觉信息;第二阶段通过概念生成器抽取并选择相关概念;第三阶段利用预训练语言模型引导规划信息生成;第四阶段融合前3个阶段生成的视觉、概念和规划信息,完成视觉故事生成任务。在公开数据集VIST上验证所提模型的效果,与现有模型COVS相比,其在BLEU-1,BLEU-2,ROUGE_L,Distinct-3,Distinct-4和TTR指标上提升了1.58百个分点、2.7百个分点、0.4百个分点、2.2百个分点、3.6百个分点和5.6百个分点。

关键词: 视觉故事生成, 规划学习, 视觉问答

Abstract: Visual storytelling is a growing area of interest for scholars in computer vision and natural language processing.Current models concentrate on enhancing image representation,like using external knowledge and scene diagrams.Despite some advancements have been made,they still suffer from content reuse and lack of detailed descriptions.To address these issues,this paper proposes a visual story generation model that incorporates planning learning.It poses questions across six key dimensions—theme,object,action,place,reasoning,and prediction-and uses a pretrained visual question answering language model to generate detailed answers.This approach guides the planning and designs process,leading to more nuanced visual story generation.The model is divided into four stages.The first stage extracts visual information from pictures.The second stage extracts and selects relevant concepts through the concept generator.The third stage is used pre-trained language models to guide the generation of planning information.In the fourth stage,it integrates the visual,conceptual and planning information generated in the above three stages to complete the visual story generation task.The model’s effectiveness is validated on the VIST dataset,outperforming the COVS model with improvements in BLEU-1,BLEU-2,ROUGE_L,Distinct-3,Distinct-4,and TTR scores by 1.58 percentage points,2.7 percentage points,0.4 percentage points,2.2 percentage points,3.6 percentage points,and 5.6 percentage points respectively.

Key words: Visual storytelling, Planning learning, Visual question answering

中图分类号: 

  • TP391
[1]WANG R,WEI Z,LI P,et al.Story telling from an ImageStream Using Scene Graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:9185-9192.
[2]HSU C C,CHEN Z Y,HSU C Y,et al.Knowledge-Enriched Visual Storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:7952-7960.
[3]LI M M,JIANG A W,LONG Y Z,et al.Visual story generation algorithm based on fine-grained visual features and knowledge graph[J].Journal of Chinese Information Technology,2022,36(9):139-148.
[4]GU J,WANG H,FAN R.Coherent Visual Storytelling viaParallel Top-Down Visual and Topic Attention[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,33(1):257-268.
[5]LIU D,LAPATA M,KELLER F.Visual Storytelling withQuestion-Answer Plans[M]//Findings of the Association for Computational Linguistics:EMNLP 2023.ACL,2023:5800-5813.
[6]CHENG S,GUO Z,WU J,et al.Ego Think:Evaluating First-Person Perspective Thinking Capability of Vision-Language Models[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:14291-14302.
[7]HUANG T H,FERRARO F,MOSTAFAZAD EH N,et al.Visual storytelling[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:1233-1239.
[8]KIM T,HEO M O,SON S,et al.GLAC Net:GLocal Attention Cascading Networks for Multi-image Cued Story Generation[J].arXiv:1805.10973,2018.
[9]WANG J,FU J,TANG J,et al.Show,Reward and Tell:Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training[C]//AAAI Conference on Artificial Intelligence.2018:7396-74003.
[10]CHEN H,HUANG Y,TAKAMURA H,et al.Commonsenseknowledge aware concept selection for diverse and informative visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:999-1008.
[11]HSU C Y,CHU Y W,HUANG T H K,et al.Plot and Rework:Modeling Storylines for Visual Storytelling[C]//Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing.2021:4443-4453.
[12]NARAYAN S,MAYNEZ J,AMPLAYO R K,et al.Conditional generation with a Question-Answering Blueprint[J].Transactions of the Association for Computational Linguistics,2023,11:974-996.
[13]REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149
[14]LI Z,YANG B,LIU Q,et al.Monkey:Image Resolution andText Label Are Important Things for Large Multi-modal Mo-dels[C]//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:26753-26763.
[15]PAPINENI K,ROUKOS S,WARD T,et al.BLEU:A methodfor automatic evaluation of machine translation[C]//Procee-dings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[16]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[17]LIN C Y,OCH F J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.2004:21-26.
[18]VEDANTAM R,ZITNICK C L,PARIKH D.C-IDEr:Consensus based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:4566-4575.
[19]LI J,GALLEY M,BROCKETT C,et al.A Diversity-Promoting objective function for neural conversation conversation Models[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:110-119.
[20]CONNORS L H,LIM A,PROKAEVAT,et al.Tabulation of human transthyretin(TTR)variants[J].Amyloid,2003,10(3):160-84.
[21]WANG X,CHEN W,WANG Y F,et al.No metrics are perfect:Adversarial reward learning for visual storytelling[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.ACL,2018:899-909
[22]WANG E,HAN S C,POON J.SCO-VIST:Social Interaction Commonsense Knowledge-based Visual Storytelling[C]//Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics.2024:1602-1616.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!