计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 180-186.doi: 10.11896/jsjkx.211100129

• 计算机图形学&多媒体 • 上一篇    下一篇

基于解耦-检索-生成的图像风格化描述生成模型

陈章辉, 熊贇   

  1. 复旦大学计算机科学技术学院 上海 200433
    上海市数据科学重点实验室 上海 200433
  • 收稿日期:2021-11-12 修回日期:2022-02-23 出版日期:2022-06-15 发布日期:2022-06-08
  • 通讯作者: 熊贇(yunx@fudan.edu.cn)
  • 作者简介:(18210240004@fudan.edu.cn)

Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate

CHEN Zhang-hui, XIONG Yun   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
    Shanghai Key Laboratory of Data Science,Shanghai 200433,China
  • Received:2021-11-12 Revised:2022-02-23 Online:2022-06-15 Published:2022-06-08
  • About author:CHEN Zhang-hui,born in 1994,postgraduate.His main research interests include big data and data mining.
    XIONG Yun,born in 1980,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include data science and data mining.

摘要: 图像描述旨在为输入的图像生成描述文本以准确描述图像内容,而图像的风格化描述在此基础上引入了对语言风格的考虑,恰当表达出特定的语言风格,使得模型生成的描述文本更具多样性。为了更好地在生成的描述文本中融入风格元素,提出了基于解耦-检索-生成的图像风格化描述生成模型。该模型首先将风格化语料中的句子拆分成内容词汇和风格词汇,并构建了一个内容-风格词汇的记忆模块;然后根据图像的事实描述从记忆模块中检索出与之相匹配的风格词汇;最后将图像的事实描述和检索出的风格词汇输入语言模型中生成风格描述。在真实数据集上的实验结果表明,相比已有方法,所提模型在各项评价指标上都有着更好的性能表现,可以在描述图像内容的同时表达出特定的风格。

关键词: 编码器-解码器, 深度学习, 图像描述, 文本生成, 语言风格

Abstract: Image captioning aims to generate a description text for the input image to accurately describe the image content.The stylized image captioning goes a step further on the basis of image captioning and introduces the consideration of language style.It also needs appropriately express the specific language style,which makes the generated text more diverse.In order to better incorporate style factors to the description text,a stylized image captioning model based on disentangle-retrieve-generate framework is proposed.The model first splits the sentences in the stylized corpus into content and style parts,and constructs a content-style memory module,then retrieves appropriate style from the memory module according to the factual caption of the image.Finally,the factual caption and retrieved style part are input into the language model for stylized caption generation.Experimental results on real datasets show that,compared to existing methods,the proposed model has better performance in various evaluation me-trics,and can accurately describe the image content while expressing a specific style.

Key words: Deep learning, Encoder-decoder, Image captioning, Language style, Text generation

中图分类号: 

  • TP181
[1] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[2] PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649.
[3] GUO L,LIU J,YAO P,et al.Mscap:Multi-style image captioning with unpaired stylized text[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4204-4213.
[4] BELL A.Language style as audience design[J].Language in society,1984,13(2):145-204.
[5] PENNEBAKER J W.The secret life of pronouns[J].NewScientist,2011,211(2828):42-45.
[6] MATHEWS A,XIE L,HE X.Senticap:Generating image descriptions with sentiments[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016:3574-3580.
[7] ZHAO W,WU X,ZHANG X.Memcap:Memorizing styleknowledge for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12984-12992.
[8] GAN C,GAN Z,HE X,et al.Stylenet:Generating attractivevisual captions with styles[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3137-3146.
[9] CHEN T,ZHANG Z,YOU Q,et al.“Factual” or “Emotional”':Stylized Image Captioning with Adaptive Learning and Attention[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:519-535.
[10] CHEN C K,PAN Z,LIU M Y,et al.Unsupervised stylish image description generation via domain layer norm[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8151-8158.
[11] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[12] KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).2014:1746-1751.
[13] HUANG L,WANG W,CHEN J,et al.Attention on attentionfor image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[14] PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[15] CER D,YANG Y,KONG S,et al.Universal sentence encoder[J].arXiv:1803.11175,2018.
[16] REIMERS N,GUREVYCH I.Sentence-bert:Sentence embed-dings using siamese bert-networks[J].arXiv:1908.10084,2019.
[17] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216.
[18] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[19] DAI N,LIANG J,QIU X,et al.Style Transformer:UnpairedText Style Transfer without Disentangled Latent Representation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:5997-6007.
[20] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[21] SUDHAKAR A,UPADHYAY B,MAHESWARAN A.Transforming Delete,Retrieve,Generate Approach for Controlled Text Style Transfer[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing(EMNLP).2019:3269-3279.
[22] PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[23] BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[24] VEDANTAM R,LAWRENCE Z C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[25] STOLCKE A.SRILM-an extensible language modeling toolkit[C]//Seventh International Conference on Spoken Language Processing.2002:901-904.
[26] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[27] LAMPLE G,SUBRAMANIAN S,SMITH E,et al.Multiple-attribute text rewriting[J].arXiv:1811.00552,2018.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[9] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[10] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[11] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[12] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[13] 刘伟业, 鲁慧民, 李玉鹏, 马宁.
指静脉识别技术研究综述
Survey on Finger Vein Recognition Research
计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[14] 孙福权, 崔志清, 邹彭, 张琨.
基于多尺度特征的脑肿瘤分割算法
Brain Tumor Segmentation Algorithm Based on Multi-scale Features
计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217
[15] 康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩.
基于Transformer和LSTM的药物相互作用预测
Drug-Drug Interaction Prediction Based on Transformer and LSTM
计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!