Computer Science ›› 2022, Vol. 49 ›› Issue (6): 180-186.doi: 10.11896/jsjkx.211100129

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate

CHEN Zhang-hui, XIONG Yun   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
    Shanghai Key Laboratory of Data Science,Shanghai 200433,China
  • Received:2021-11-12 Revised:2022-02-23 Online:2022-06-15 Published:2022-06-08
  • About author:CHEN Zhang-hui,born in 1994,postgraduate.His main research interests include big data and data mining.
    XIONG Yun,born in 1980,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include data science and data mining.

Abstract: Image captioning aims to generate a description text for the input image to accurately describe the image content.The stylized image captioning goes a step further on the basis of image captioning and introduces the consideration of language style.It also needs appropriately express the specific language style,which makes the generated text more diverse.In order to better incorporate style factors to the description text,a stylized image captioning model based on disentangle-retrieve-generate framework is proposed.The model first splits the sentences in the stylized corpus into content and style parts,and constructs a content-style memory module,then retrieves appropriate style from the memory module according to the factual caption of the image.Finally,the factual caption and retrieved style part are input into the language model for stylized caption generation.Experimental results on real datasets show that,compared to existing methods,the proposed model has better performance in various evaluation me-trics,and can accurately describe the image content while expressing a specific style.

Key words: Deep learning, Encoder-decoder, Image captioning, Language style, Text generation

CLC Number: 

  • TP181
[1] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[2] PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649.
[3] GUO L,LIU J,YAO P,et al.Mscap:Multi-style image captioning with unpaired stylized text[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4204-4213.
[4] BELL A.Language style as audience design[J].Language in society,1984,13(2):145-204.
[5] PENNEBAKER J W.The secret life of pronouns[J].NewScientist,2011,211(2828):42-45.
[6] MATHEWS A,XIE L,HE X.Senticap:Generating image descriptions with sentiments[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016:3574-3580.
[7] ZHAO W,WU X,ZHANG X.Memcap:Memorizing styleknowledge for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12984-12992.
[8] GAN C,GAN Z,HE X,et al.Stylenet:Generating attractivevisual captions with styles[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3137-3146.
[9] CHEN T,ZHANG Z,YOU Q,et al.“Factual” or “Emotional”':Stylized Image Captioning with Adaptive Learning and Attention[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:519-535.
[10] CHEN C K,PAN Z,LIU M Y,et al.Unsupervised stylish image description generation via domain layer norm[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8151-8158.
[11] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[12] KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).2014:1746-1751.
[13] HUANG L,WANG W,CHEN J,et al.Attention on attentionfor image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[14] PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[15] CER D,YANG Y,KONG S,et al.Universal sentence encoder[J].arXiv:1803.11175,2018.
[16] REIMERS N,GUREVYCH I.Sentence-bert:Sentence embed-dings using siamese bert-networks[J].arXiv:1908.10084,2019.
[17] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216.
[18] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[19] DAI N,LIANG J,QIU X,et al.Style Transformer:UnpairedText Style Transfer without Disentangled Latent Representation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:5997-6007.
[20] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[21] SUDHAKAR A,UPADHYAY B,MAHESWARAN A.Transforming Delete,Retrieve,Generate Approach for Controlled Text Style Transfer[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing(EMNLP).2019:3269-3279.
[22] PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[23] BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[24] VEDANTAM R,LAWRENCE Z C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[25] STOLCKE A.SRILM-an extensible language modeling toolkit[C]//Seventh International Conference on Spoken Language Processing.2002:901-904.
[26] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[27] LAMPLE G,SUBRAMANIAN S,SMITH E,et al.Multiple-attribute text rewriting[J].arXiv:1811.00552,2018.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[3] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[4] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[9] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[10] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[11] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[12] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[13] WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324.
[14] CHU Yu-chun, GONG Hang, Wang Xue-fang, LIU Pei-shun. Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 [J]. Computer Science, 2022, 49(6A): 337-344.
[15] ZHOU Zhi-hao, CHEN Lei, WU Xiang, QIU Dong-liang, LIANG Guang-sheng, ZENG Fan-qiao. SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm [J]. Computer Science, 2022, 49(6A): 562-570.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!