Computer Science ›› 2020, Vol. 47 ›› Issue (12): 183-189.doi: 10.11896/jsjkx.190900181

Previous Articles     Next Articles

Study on Joint Generation of Bilingual Image Captions

ZHANG Kai, LI Jun-hui, ZHOU Guo-dong   

  1. School of Computer Science and Technology Soochow University Suzhou Jiangsu 215006,China
  • Received:2019-09-26 Revised:2020-03-07 Published:2020-12-17
  • About author:ZHANG Kai,born in 1992graduate studentis a member of China ComputerFederation.His main research interests include natural language processingmachine translation and image caption.
    LI Jun-hui,born in 1983associate professor.His main research interests include natural language processing and machine translation.
  • Supported by:
    National Natural Science Foundation of China(61876120).

Abstract: Most of the research on image caption is to generate a single language caption from an imagebut in the context of the convergence of languages in various countriesit is necessary to generate two languages or even multiple languages from one image.Native speakers understand what other people say about the imageso this paper proposes an approach to generation of bilingual image captionsi.e.generating two captions in two languages for an image.The architecture consists of an encoder and two decodersin which the encoder uses convolutional neural network to extract image features while the decoders adopt Long Short-Term Memory networks.Motivated by the fact that the two captions of an image are semantically equivalentwe propose a joint model to generate bilingual image captions.Specificallythe two decoders generate image captions in alternative waymaking the decoding history information of two languages are both available to predict the next word.The experimental results based on the MSCOCO2014 data set show that the joint generation of bilingual image caption can improve the performance of two languages at the same time.Compared with the single language image caption performance in Englishthe BLEU_4 increases by 1.0CIDEr increases by 0.98 in Japanese.Compared with the Japanese single image caption generation performancethe BLEU_4 increases by 1.0CIDEr increases by 0.31.

Key words: Alternative generation, Bilingual image captions, Joint model

CLC Number: 

  • TP391.1
[1] ALI F,HEJRATI M,AMIN M S,et al.Every Picture Tells a Story:Generating Sentences from Images[C]//Proceedings Part IV of the 11th European Conference on Computer Vision.Heraklion,Crete,Greece:Springer,2010:15-29.
[2] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Babytalk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903.
[3] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:3156-3164.
[4] KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[5] MAO J H,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-rnn)[J].arXiv:1412.6632.
[6] XU J,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:2407-2415.
[7] WU Q,SHEN C H,LIU L Q,et al.What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition.2016:203-212.
[8] XU K,BA J,KIROS R,et al.Show,Attend and Tell:Neural Ima-ge Caption Generation with Visual Attention[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:2048-2057.
[9] LU J S,XIONG C M,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition.2017:375-383.
[10] CHEN L,ZHANG H W,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[11] LI X R,LAN W Y,DONG J F,et al.Adding Chinese Captions to Ima-ges[C]//Proceedings of the 2016 Association for Computing Machinery(ACM) on International Conference on Multimedia Retrieval.New York,USA:ACM,2016:271-275.
[12] SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:1-9.
[13] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[14] ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[15] DOGNIN P L,MELNYK I,MROUEH Y,et al.AdversarialSemantic Alignment for Improved Image Captions[J].arXiv:1805.00063v3.
[16] BITEN A F,GOMEZ L,RUSIÑOL,MARçAL,et al.Good News,Everyone! Context driven entity-aware captioning for news images[J].arXiv:1904.01475.
[17] KIM D J,CHOI J,OH T H,et al.Dense Relational Captioning:Triple-Stream Networks for Relationship-Based Captioning[J].arXiv:1903.05942v3.
[18] MITRA S,AVRA L J,MCCLUSKEY E J,et al.Scan synthesis for one-hot signals[C]//Proceedings International Test Confe-rence.IEEE,1997:714-722.
[19] WERLEN L M,PAPPAS N,RAM D,et al.Self-attentive residual decoder for neural machine translation[J].arXiv:1709.04849.
[20] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European conference on computer vision.Cham:Springer,2014:740-755.
[21] YOSHIKAWA Y,SHIGETO Y,Takeuchi A.Stair captions:Constructing a large-scale japanese image caption dataset[J].arXiv:1705.00823,2017.
[22] PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics.Philadelphia,PA,USA:ACL,2002:311-318.
[23] DENKOWSKI M,LAVI A.Meteor universal:Language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.2014:376-380.
[24] LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Post-Conference Workshop of ACL 2004.2004.
[25] VEDANTAM R,ZITNICK C L,PARIKH D,et al.CIDEr:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[26] ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Cham:Springer,2016:382-398.
[27] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[28] RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[29] KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980.
[30] WISEMAN S,RUSH A M.Sequence-to- sequence learning asbeam-search optimization[J].arXiv:1606.02960.
[31] IOFFE S,SZEGEDY C.Batch Normalization.Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:448-456.
[32] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[1] HUANG Yi-long, LI Pei-feng, ZHU Qiao-ming. Joint Model of Events’ Causal and Temporal Relations Identification [J]. Computer Science, 2018, 45(6): 204-207.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!