Computer Science ›› 2025, Vol. 52 ›› Issue (9): 269-275.doi: 10.11896/jsjkx.240700136

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Visual Storytelling Based on Planning Learning

WANG Yuanlong, ZHANG Ningqian, ZHANG Hu   

  1. School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
  • Received:2024-07-22 Revised:2024-11-01 Online:2025-09-15 Published:2025-09-11
  • About author:WANG Yuanlong,born in 1982,Ph.D,associate professor,is a member of CCF(No.48432M).His main research interests include natural language processing and graphics image processing.
  • Supported by:
    National Natural Science Foundation of China(62176145).

Abstract: Visual storytelling is a growing area of interest for scholars in computer vision and natural language processing.Current models concentrate on enhancing image representation,like using external knowledge and scene diagrams.Despite some advancements have been made,they still suffer from content reuse and lack of detailed descriptions.To address these issues,this paper proposes a visual story generation model that incorporates planning learning.It poses questions across six key dimensions—theme,object,action,place,reasoning,and prediction-and uses a pretrained visual question answering language model to generate detailed answers.This approach guides the planning and designs process,leading to more nuanced visual story generation.The model is divided into four stages.The first stage extracts visual information from pictures.The second stage extracts and selects relevant concepts through the concept generator.The third stage is used pre-trained language models to guide the generation of planning information.In the fourth stage,it integrates the visual,conceptual and planning information generated in the above three stages to complete the visual story generation task.The model’s effectiveness is validated on the VIST dataset,outperforming the COVS model with improvements in BLEU-1,BLEU-2,ROUGE_L,Distinct-3,Distinct-4,and TTR scores by 1.58 percentage points,2.7 percentage points,0.4 percentage points,2.2 percentage points,3.6 percentage points,and 5.6 percentage points respectively.

Key words: Visual storytelling, Planning learning, Visual question answering

CLC Number: 

  • TP391
[1]WANG R,WEI Z,LI P,et al.Story telling from an ImageStream Using Scene Graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:9185-9192.
[2]HSU C C,CHEN Z Y,HSU C Y,et al.Knowledge-Enriched Visual Storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:7952-7960.
[3]LI M M,JIANG A W,LONG Y Z,et al.Visual story generation algorithm based on fine-grained visual features and knowledge graph[J].Journal of Chinese Information Technology,2022,36(9):139-148.
[4]GU J,WANG H,FAN R.Coherent Visual Storytelling viaParallel Top-Down Visual and Topic Attention[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,33(1):257-268.
[5]LIU D,LAPATA M,KELLER F.Visual Storytelling withQuestion-Answer Plans[M]//Findings of the Association for Computational Linguistics:EMNLP 2023.ACL,2023:5800-5813.
[6]CHENG S,GUO Z,WU J,et al.Ego Think:Evaluating First-Person Perspective Thinking Capability of Vision-Language Models[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:14291-14302.
[7]HUANG T H,FERRARO F,MOSTAFAZAD EH N,et al.Visual storytelling[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:1233-1239.
[8]KIM T,HEO M O,SON S,et al.GLAC Net:GLocal Attention Cascading Networks for Multi-image Cued Story Generation[J].arXiv:1805.10973,2018.
[9]WANG J,FU J,TANG J,et al.Show,Reward and Tell:Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training[C]//AAAI Conference on Artificial Intelligence.2018:7396-74003.
[10]CHEN H,HUANG Y,TAKAMURA H,et al.Commonsenseknowledge aware concept selection for diverse and informative visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:999-1008.
[11]HSU C Y,CHU Y W,HUANG T H K,et al.Plot and Rework:Modeling Storylines for Visual Storytelling[C]//Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing.2021:4443-4453.
[12]NARAYAN S,MAYNEZ J,AMPLAYO R K,et al.Conditional generation with a Question-Answering Blueprint[J].Transactions of the Association for Computational Linguistics,2023,11:974-996.
[13]REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149
[14]LI Z,YANG B,LIU Q,et al.Monkey:Image Resolution andText Label Are Important Things for Large Multi-modal Mo-dels[C]//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:26753-26763.
[15]PAPINENI K,ROUKOS S,WARD T,et al.BLEU:A methodfor automatic evaluation of machine translation[C]//Procee-dings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[16]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[17]LIN C Y,OCH F J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.2004:21-26.
[18]VEDANTAM R,ZITNICK C L,PARIKH D.C-IDEr:Consensus based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:4566-4575.
[19]LI J,GALLEY M,BROCKETT C,et al.A Diversity-Promoting objective function for neural conversation conversation Models[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:110-119.
[20]CONNORS L H,LIM A,PROKAEVAT,et al.Tabulation of human transthyretin(TTR)variants[J].Amyloid,2003,10(3):160-84.
[21]WANG X,CHEN W,WANG Y F,et al.No metrics are perfect:Adversarial reward learning for visual storytelling[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.ACL,2018:899-909
[22]WANG E,HAN S C,POON J.SCO-VIST:Social Interaction Commonsense Knowledge-based Visual Storytelling[C]//Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics.2024:1602-1616.
[1] XU Yutao, TANG Shouguo. External Knowledge Query-based for Visual Question Answering [J]. Computer Science, 2025, 52(6A): 240400101-8.
[2] XU Yutao, TANG Shouguo. Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module [J]. Computer Science, 2025, 52(6A): 240800086-7.
[3] HE Shiyang, WANG Zhaohui, GONG Shengrong, ZHONG Shan. Cross-modal Information Filtering-based Networks for Visual Question Answering [J]. Computer Science, 2024, 51(5): 85-91.
[4] LI Xiang, FAN Zhiguang, LI Xuexiang, ZHANG Weixing, YANG Cong, CAO Yangjie. Survey of Visual Question Answering Based on Deep Learning [J]. Computer Science, 2023, 50(5): 177-188.
[5] ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui. Visual Question Answering Model Based on Multi-modal Deep Feature Fusion [J]. Computer Science, 2023, 50(2): 123-129.
[6] WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping. Knowledge-based Visual Question Answering:A Survey [J]. Computer Science, 2023, 50(1): 166-175.
[7] YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng. Visual Question Answering Method Based on Counterfactual Thinking [J]. Computer Science, 2022, 49(12): 229-235.
[8] NIU Yu-lei, ZHANG Han-wang. Survey on Visual Question Answering and Dialogue [J]. Computer Science, 2021, 48(3): 87-96.
[9] XU Sheng, ZHU Yong-xin. Study on Question Processing Algorithms in Visual Question Answering [J]. Computer Science, 2020, 47(11): 226-230.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!