计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 183-189.doi: 10.11896/jsjkx.190900181

• 计算机图形学与多媒体 • 上一篇    下一篇

双语图像标题联合生成研究

张凯, 李军辉, 周国栋   

  1. 苏州大学计算机科学与技术学院 江苏 苏州 215006
  • 收稿日期:2019-09-26 修回日期:2020-03-07 发布日期:2020-12-17
  • 通讯作者: 李军辉(jhli@suda.edu.cn)
  • 作者简介:suda_zk@163.com
  • 基金资助:
    国家自然科学基金(61876120)

Study on Joint Generation of Bilingual Image Captions

ZHANG Kai, LI Jun-hui, ZHOU Guo-dong   

  1. School of Computer Science and Technology Soochow University Suzhou Jiangsu 215006,China
  • Received:2019-09-26 Revised:2020-03-07 Published:2020-12-17
  • About author:ZHANG Kai,born in 1992graduate studentis a member of China ComputerFederation.His main research interests include natural language processingmachine translation and image caption.
    LI Jun-hui,born in 1983associate professor.His main research interests include natural language processing and machine translation.
  • Supported by:
    National Natural Science Foundation of China(61876120).

摘要: 图像标题(ImageCaption)的研究大多是对图像生成单一语言的标题而在当今各国语言交汇融合的情况下对一张图像生成两门甚至多门语言标题是必然趋势以让不同母语的人理解其他人对同一张图片的评价.对此提出一种双语图像标题即图像同时生成两种语言标题的方法.该方法由一个编码器和两个不同的解码器组成其中编码器基于卷积神经网络用于提取图像特征;解码器基于长短时记忆网络两个不同的解码器分别用于解码两种不同的语言特征.由于两种语言标题之间存在着互译的特性因此提出了双语料图像标题的联合生成模型.具体地在解码端采用交替的方式生成两种语言的标题使得在预测某种语言的下一个单词时不仅可以利用该语言标题的历史信息还可以利用另一门语言标题的历史信息同时促进两种语言标题生成的性能.基于MSCOCO2014数据集的实验结果表明双语图像标题联合生成能够同时提高两门语言的性能在英文上较英文单语言标题生成的性能提高了1.0个BLEU_4值和0.98个CIDEr值在日文上较日文单语言标题生成的性能提高了1.0个BLEU_4值和0.31个CIDEr值.

关键词: 交替生成, 联合模型, 图像双语标题

Abstract: Most of the research on image caption is to generate a single language caption from an imagebut in the context of the convergence of languages in various countriesit is necessary to generate two languages or even multiple languages from one image.Native speakers understand what other people say about the imageso this paper proposes an approach to generation of bilingual image captionsi.e.generating two captions in two languages for an image.The architecture consists of an encoder and two decodersin which the encoder uses convolutional neural network to extract image features while the decoders adopt Long Short-Term Memory networks.Motivated by the fact that the two captions of an image are semantically equivalentwe propose a joint model to generate bilingual image captions.Specificallythe two decoders generate image captions in alternative waymaking the decoding history information of two languages are both available to predict the next word.The experimental results based on the MSCOCO2014 data set show that the joint generation of bilingual image caption can improve the performance of two languages at the same time.Compared with the single language image caption performance in Englishthe BLEU_4 increases by 1.0CIDEr increases by 0.98 in Japanese.Compared with the Japanese single image caption generation performancethe BLEU_4 increases by 1.0CIDEr increases by 0.31.

Key words: Alternative generation, Bilingual image captions, Joint model

中图分类号: 

  • TP391.1
[1] ALI F,HEJRATI M,AMIN M S,et al.Every Picture Tells a Story:Generating Sentences from Images[C]//Proceedings Part IV of the 11th European Conference on Computer Vision.Heraklion,Crete,Greece:Springer,2010:15-29.
[2] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Babytalk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903.
[3] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:3156-3164.
[4] KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[5] MAO J H,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-rnn)[J].arXiv:1412.6632.
[6] XU J,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:2407-2415.
[7] WU Q,SHEN C H,LIU L Q,et al.What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition.2016:203-212.
[8] XU K,BA J,KIROS R,et al.Show,Attend and Tell:Neural Ima-ge Caption Generation with Visual Attention[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:2048-2057.
[9] LU J S,XIONG C M,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition.2017:375-383.
[10] CHEN L,ZHANG H W,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[11] LI X R,LAN W Y,DONG J F,et al.Adding Chinese Captions to Ima-ges[C]//Proceedings of the 2016 Association for Computing Machinery(ACM) on International Conference on Multimedia Retrieval.New York,USA:ACM,2016:271-275.
[12] SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:1-9.
[13] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[14] ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[15] DOGNIN P L,MELNYK I,MROUEH Y,et al.AdversarialSemantic Alignment for Improved Image Captions[J].arXiv:1805.00063v3.
[16] BITEN A F,GOMEZ L,RUSIÑOL,MARçAL,et al.Good News,Everyone! Context driven entity-aware captioning for news images[J].arXiv:1904.01475.
[17] KIM D J,CHOI J,OH T H,et al.Dense Relational Captioning:Triple-Stream Networks for Relationship-Based Captioning[J].arXiv:1903.05942v3.
[18] MITRA S,AVRA L J,MCCLUSKEY E J,et al.Scan synthesis for one-hot signals[C]//Proceedings International Test Confe-rence.IEEE,1997:714-722.
[19] WERLEN L M,PAPPAS N,RAM D,et al.Self-attentive residual decoder for neural machine translation[J].arXiv:1709.04849.
[20] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European conference on computer vision.Cham:Springer,2014:740-755.
[21] YOSHIKAWA Y,SHIGETO Y,Takeuchi A.Stair captions:Constructing a large-scale japanese image caption dataset[J].arXiv:1705.00823,2017.
[22] PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics.Philadelphia,PA,USA:ACL,2002:311-318.
[23] DENKOWSKI M,LAVI A.Meteor universal:Language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.2014:376-380.
[24] LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Post-Conference Workshop of ACL 2004.2004.
[25] VEDANTAM R,ZITNICK C L,PARIKH D,et al.CIDEr:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[26] ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Cham:Springer,2016:382-398.
[27] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[28] RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[29] KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980.
[30] WISEMAN S,RUSH A M.Sequence-to- sequence learning asbeam-search optimization[J].arXiv:1606.02960.
[31] IOFFE S,SZEGEDY C.Batch Normalization.Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:448-456.
[32] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[1] 黄一龙, 李培峰, 朱巧明.
事件因果与时序关系识别的联合推理模型
Joint Model of Events’ Causal and Temporal Relations Identification
计算机科学, 2018, 45(6): 204-207. https://doi.org/10.11896/j.issn.1002-137X.2018.06.036
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!