图像描述技术综述

doi:10.11896/jsjkx.200500039

Abstract

Abstract: Image captioning is a task that uses an image as input to generate the natural language description of this image by modeling and calculationso that computers have the ability to "talk about the pictures".It is another new type of computer vision task after image recognitionimage segmentation and target tracking.This paper focuses on the development of image captioning and gives a detailed survey of the image captioning methods based on templateretrieval and deep learning.And this paper especially focuses on the deep learning-based methods and discusses the experimental results of various methods.Experimental evalu-ation indexes and the common datasets used in this field are introduced in detail.Finallythis paper points out the problems and research directions in the future.

Key words: Computer vision, Deep learning, Image captioning, Image processing, Natural language processing

CLC Number:

TP301

MIAO Yi, ZHAO Zeng-shun, YANG Yu-lu, XU Ning, YANG Hao-ran, SUN Qian. Survey of Image Captioning Methods[J].Computer Science, 2020, 47(12): 149-160.

References

[1] XU H J,HUANG C Q,HUANG X D,et al.Multi-modal multi-concept-based deep neural network for automatic image annotation[J].Multimedia Tools and Applications,2019,78(21):30651-30675.
[2] ROYA R,MANSOUR J.Image annotation using multi-viewnon-negative matrix factorization with different number of basis vectors[J].Journal of Visual Communication and Image Representation,2017,46(1):1-12.
[3] ZHANG Z,ZHAO Y X,LI D,et al.A novel image annotation model based on content representation with multi-layer segmentation[J].Neural Computing and Applications,2015,26(6):1407-1422.
[4] REN Y M,CHENG X Y,LI X Y,et al.Description and Recognition of Image Based on Concept Semantics[J].ComputerScien-ce,2008,35(7):206-212.
[5] XIE D N,ROSS G,PIOTR D,et al.Aggregated residual transformations for deep neural networks[C]//30th IEEE Confe-rence on Computer Vision and Pattern Recognition,CVPR 2017.Honolulu,HI,United states:Institute of Electrical and Electronics Engineers Inc.,2017:5987-5995.
[6] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:770-778.
[7] CHRISTIAN S,LIU W,YANG Q J,et al.Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2015).Boston,MA,United States:IEEE Computer Society,2015:1-9.
[8] CHRISTIAN S,SERGEY I,VINCENT V,et al.Inception-v4,inception-ResNet and the impact of residual connections on learning[C]//31st AAAI Conference on Artificial Intelligence(AAAI 2017).San Francisco,CA,United States:AAAI Press,2017:4278-4284.
[9] SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the Inception Architecture for Computer Vision[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:2818-2826.
[10] ZHANG W W,ZHOU H,SUN S Y,et al.Robust multi-modality multi-object tracking[C]//17th IEEE/CVF International Conference on Computer Vision(ICCV 2019).Seoul,Korea,Republic of Institute of Electrical and Electronics Engineers Inc.,2019:2365-2374.
[11] KAREN S,ANDREW Z.Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations(ICLR 2015).San Diego,CA,Uni-ted States:International Conference on Learning Representations,ICLR,2015.
[12] STEFANO M.Knowledge enhanced representations to reduce the semantic gap in clinical decision support[C]//9th PhD Symposium on Future Directions in Information Access(FDIA 2019).Milan,Italy:CEUR-WS,2019:4-9.
[13] TANG J H,ZHA Z J,TAO D C,et al.Semantic-Gap-Oriented Active Learning for Multilabel Image Annotation[J].Ieee Transactions on Image Processing,2012,21(4):2354-2360.
[14] YANG X,TANG K H,ZHANG H W,et al.Auto-encodingscene graphs for image captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:10677-10686.
[15] LU J S,XIONG C M,DEVI P,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United states:Institute of Electrical and Electronics Engineers Inc.,2017:3242-3250.
[16] CHEN S Z,JIN Q,WANG P,et al.Say As You Wish:Fine-grained Control of Image Caption Generation with Scene Graphs[J].arXiv:2003.00387.
[17] CHEN T S,LIN L,ZUO W M,et al.Learning a wavelet-like auto-encoder to accelerate deep neural networks[C]//32nd AAAI Conference on Artificial Intelligence(AAAI 2018).New Orlea-ns,LA,United states:AAAI press,2018:6722-6729.
[18] HYUN K,HYUNSOO Y,KI-WOONG P.Multi-targeted backdoor:Indentifying backdoor attack for multiple deep neural networks[J].IEICE Transactions on Information and Systems,2020,E103D(4):883-887.
[19] OORD A V D,LI Y Z,BABUSCHKIN I,et al.ParallelWaveNet:Fast high-fidelity speech synthesis[C]//35th International Conference on Machine Learning,ICML.Stockholm,Sweden:International Machine Learning Society (IMLS),2018:6270-6278.
[20] ZHU J Y,ZHANG R,PATHAK D,et al.Toward multimodal image-to-image translation[C]//31st Annual Conference on Neural Information Processing Systems(NIPS 2017).Long Beach,CA,United States:Neural Information Processing Systems Foundation,2017:466-477.
[21] WANG Q,MAO Z D,WANG B,et al.Knowledge graph embedding:A survey of approaches and applications[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(12):2724-2743.
[22] HINTON G,OSINDERO S,TEH Y W.A fast learning algorithm for deep belief nets[J].Neural computation,2006,18(7):1527-1554.
[23] WANG Y Y,WANG L,QI J,et al.Improved text clustering algorithm and application in microblogging public opinion analysis[C]//2013 4th World Congress on Software Engineering(WCSE 2013).Hong Kong,China:IEEE Computer Society,2013:27-31.
[24] YANG Y X.Research and Realization of Internet Public Opinion Analysis Based on Improved TF-IDF Algorithm[C]//16th International Symposium on Distributed Computing and Applications to Business,Engineering and Science(DCABES 2017).AnYang,He Nan,China:Institute of Electrical and Electronics Engineers Inc.,2017:80-83.
[25] IOFFE S,SZEGEDY C.Batch normalization:Accelerating deep network training by reducing internal covariate shift[C]//32nd International Conference on Machine Learning(ICML 2015).Lile,France:International Machine Learning Society (IMLS),2015:448-456.
[26] HE K M,ZHANG X Y,REN S Q,et al.Identity mappings in deep residual networks[C]//21st ACM Conference onCompu-ter and Communications Security(CCS 2014).Scottsdale,AZ,United states:Springer Verlag,2016:630-645.
[27] WU Z F,SHEN C H,HENGEL A V D.Wider or Deeper:Revisiting the ResNet Model for Visual Recognition[J].Pattern Recognition,2019,90:119-133.
[28] WANG N,SONG Y B,MA C,et al.Unsupervised deep tracking[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:1308-1317.
[29] ANDERSON P,HE X D,BUEHLER C,et al.Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering[C]//31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2018).Salt Lake City,UT,United States:IEEE Computer Society,2018:6077-6086.
[30] DESHPANDE A,ANEJA J,WANG L W,et al.Fast,diverse and accurate image captioning guided by part-of-speech[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United states:IEEE Computer Society,2019:10687-10696.
[31] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2015).Boston,MA,United States:IEEE Computer Society,2015:3156-3164.
[32] XU K,BA J L,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//32nd International Conference on Machine Learning.Lile,France:International Machine Learning Society (IMLS),2015:2048-2057.
[33] CORNIA M,BARALDI L,CUCCHIARA R.Show,control and tell:A framework for generating controllable and grounded captions[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:8299-8308.
[34] HOSSAIN M Z,SOHEL F,SHIRATUDDIN M F,et al.A comprehensive survey of deep learning for image captioning[J].ACM Computing Surveys,2019,51(6):118:1-118:36.
[35] YAGI M,SHIBATA T,TAKADA K.Human-perception-likeimage recognition system based on the Associative Processor architecture[C]//11th European Signal Processing Conference,EUSIPCO.Toulouse,France:European Signal Processing Conference,EUSIPCO,2002.
[36] ITO S,MITSUKURA Y,FUKUMI M,et al.The image recognition system by using the FA and SNN[C]//7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems(KES 2003).Oxford,United Kingdom:Springer Verlag,2003:578-584.
[37] KEYSERS D,DESELAERS T,NEY H.Pixel-to-pixel matching for image recognition using Hungarian graph matching[C]//26th DAGM Symposium on Pattern Recognition.Tubingen,Germany:Springer Verlag,2004:154-162.
[38] FARHADI A,HEJRATI S M M,SADEGHI M A,et al.Every Picture Tells a Story:Generating Sentences from Images[J].lecture notes in computer science,2010,21(10):15-29.
[39] DAVID V,SANCHEZ A.Advanced support vector machinesand kernel methods[J].Neurocomputing,2003,55(1/2):5-20.
[40] CHEN P H,LIN C J,SCHOLKOPF B.A tutorial on v-support vector machines[J].Applied Stochastic Models in Business and Industry,2005,21(2):111-136.
[41] EVERINGHAM M,GOOL L V,WILLIAMS C K I,et al.The Pascal Visual Object Classes (VOC) challenge[J].International Journal of Computer Vision,2010,88(2):303-338.
[42] LI S M,KULKARNI G,BERG T L,et al.Composing simpleimage descriptions using web-scale N-grams[C]//15th Confe-rence on Computational Natural Language Learning(CoNLL 2011).Portland,OR,United states:Association for Computational Linguistics (ACL),2011:220-228.
[43] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Baby talk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903.
[44] QI Y,SZUMMER M,MINKA T P.Bayesian conditional random fields[C]//10th International Workshop on Artificial Intelligence and Statistics(AISTATS 2005).Hastings,Christ Church,Barbados:The Society for Artificial Intelligence and Statistics,2005:269-276.
[45] SUTTON C,MCCALLUM A,ROHANIMANESH K.Dynamic conditional random fields:Factorized probabilistic models for labeling and segmenting sequence data[J].Journal of Machine Learning Research,2007,8(2):693-723.
[46] MITCHELL M,HAN X F,DODGE J,et al.Midge:Generating image descriptions from computer vision detections[C]//13th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2012).Avignon,France:Association for Computational Linguistics (ACL),2012:747-756.
[47] NOR W,MOHAMED H W,SALLEH M N M,et al.A comparative study of Reduced Error Pruning method in decision tree algorithms[C]//2012 IEEE International Conference on Control System,Computing and Engineering(ICCSCE 2012).Penang,Malaysia:IEEE Computer Society,2012:392-397.
[48] ORDONEZ V,KULKARNI G,BERG T L.Im2Text:Describing images using 1 million captioned photographs[C]//25th Annual Conference on Neural Information Processing Systems 2011(NIPS 2011).Granada,Spain:Curran Associates Inc.,2011.
[49] SOCHER R,KARPATHY A,LE Q V,et al.Grounded Compositional Semantics for Finding and Describing Images with Sentences[J].Transactions of the Association for Computational Linguistics,2014,2(Q14-1017):207-218.
[50] KUZNETSOVA P,ORDONEZ V,BERG T L,et al.TreeTalk:Composition and Compression of Trees for Image Descriptions[J].Transactions of the Association for Computational Linguistics,2014,2(Q14-1017):351-362.
[51] MASON R,CHARNIAK R.Nonparametric method for data-driven image captioning[C]//52nd Annual Meeting of the Association for Computational Linguistics(ACL 2014).Baltimore,MD,United states:Association for Computational Linguistics (ACL),2014:592-598.
[52] SUN C,GAN C,NEVATIA R.Automatic concept discoveryfrom parallel text and visual corpora[C]//15th IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile:Institute of Electrical and Electronics Engineers Inc.,2015:2596-2604.
[53] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[J].arXiv:1406.1078.
[54] LI M D,MU K,ZHONG P,et al.Generating steganographic ima-ge description by dynamic synonym substitution[J].Signal Processing,2019,164:193-201.
[55] KIROS R,SALAKHUTDINOV R,ZEMEL R.Multimodal neural language models[C]//31st International Conference on Machine Learning(ICML 2014).Beijing,China:International Machine Learning Society (IMLS),2014:2012-2025.
[56] MAO J H,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-RNN)[C]//3rd International Conference on Learning Representations(ICLR 2015).San Diego,CA,United States,2015.
[57] HERMANS M,SCHRAUWEN B.Memory in linear recurrent neural networks in continuous time[J].Neural Networks,2010,23(3):341-355.
[58] GREFF K,SRIVASTAVA R K,KOUTNIK J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks and Learning Systems,2017,28(10):2222-2232.
[59] CHINEA A.Understanding the principles of recursive neuralnetworks:A generative approach to tackle model complexity[C]//19th International Conference on Artificial Neural Networks(ICANN 2009).Limassol,Cyprus:Springer Verlag,2009:952-963.
[60] SHEN Y K,TAN S,SORDONI A,et al.Ordered neurons:Integrating tree structures into recurrent neural networks[C]//7th International Conference on Learning Representations(ICLR 2019).New Orleans,LA,United States,2019.
[61] JIA X,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//15th IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile:Institute of Electrical and Electronics Engineers Inc.,2015:2407-2415.
[62] MAO J H,HUANG J,TOSHEV A,et al.Generation and comprehension of unambiguous object descriptions[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:11-20.
[63] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//3rd International Conference on Learning Representations(ICLR 2015).San Diego,CA,United States,2015.
[64] XIAO T J,XU Y C,YANG K Y,et al.The application of two-level attention models in deep convolutional neural network for fine-grained image classification[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Boston,MA,United States:IEEE Computer Society,2015:842-850.
[65] STOLLENGA M F,MASCI J,GOMEZ F,et al.Deep networks with internal selective attention through feedback connections[C]//28th Annual Conference on Neural Information Proces-sing Systems 2014(NIPS 2014).Montreal,QC,Canada,2014:3545-3553.
[66] CHU X,YANG W,OUYANG W,et al.Multi-context attention for human pose estimation[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United States:Institute of Electrical and Electronics Engineers Inc.,2017:5669-5678.
[67] ZHAO B,WU X,FENG J S,et al.Diversified Visual Attention Networks for Fine-Grained Object Classification[J].IEEE Transactions on Multimedia,2017,19(6):1245-1256.
[68] DENG Z P,SUN H,ZHOU S L,et al.Toward Fast and Accurate Vehicle Detection in Aerial Images Using Coupled Region-Based Convolutional Neural Networks[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sen-sing,2017,10(8):3652-3664.
[69] ZHOU Y E,WANG M,LIU D Q,et al.More Grounded Image Captioning by Distilling Image-Text Matching Model[J].arXiv:2004.00390.
[70] LEE K H,CHEN X,HUA G,et al.Stacked Cross Attention for Image-Text Matching[C]//15th European Conference on Computer Vision(ECCV 2018).Munich,Germany:Springer Verlag,2018:212-228.
[71] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//30th IEEEConfe-rence on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United States:Institute of Electrical and Electronics Engineers Inc.,2017:1179-1195.
[72] ZHAO Z S,GAO H X,SUN Q,et al.Latest Development of the Theory Framework,Derivative Model and Application of Genera-tive Adversarial Nets [J].Journal of Chinese Mini-Micro Computer Systems,2018,39(12):44-48.
[73] ZHAO Z S,SUN Q,YANG H R,et al.Compression Artifacts Reduction by Improved Generative Adversarial Networks[J/OL].Journal on Image and Video Processing,2019,https:∥doi.org/10.1186/s13640-019-0465-0.
[74] DAI B,SANJA F,RAQUEL U,et al.Towards Diverse and Na-tural Image Descriptions via a Conditional GAN[C]//16th IEEE International Conference on Computer Vision(ICCV 2017).Venice,Italy:Institute of Electrical and Electronics Engineers Inc.,2017:2989-2998.
[75] CHEN F H,JI R R,SUN X S,et al.GroupCap:Group-Based Image Captioning with Structured Relevance and Diversity Cons-traints[C]//31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2018).Salt Lake City,UT,United States:IEEE Computer Society,2018:1345-1353.
[76] DOGNIN P,MELNYK I,MROUEH Y,et al.Adversarial semantic alignment for improved image captions[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:10455-10463.
[77] FENG Y,MA L,LIU W,et al.Unsupervised image captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:4120-4129.
[78] GUO L T,LIU J,YAO P,et al.MSCAP:Multi-style image captioning with unpaired stylized text[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:4199-4208.
[79] ZHAO W T,WU X X,ZHANG X X.MemCap:MemorizingStyle Knowledge for Image Captioning[C]//The Thirty-Fourth AAAI Conference on Artificial Intelligence(AAAI 2020).New York,NY,USA,2020:12984-12992.
[80] SHETTY R,ROHRBACH M,HENDRICKS L A,et al.Spea-king the Same Language:Matching Machine to Human Captions by Adversarial Training[C]//16th IEEE International Confe-rence on Computer Vision(ICCV 2017).Venice,Italy:Institute of Electrical and Electronics Engineers Inc.,2017:4155-4164.
[81] TESAURO G.Temporal difference learning and TD-gammon[J].Communications of the ACM,1995,38(3):58-68.
[82] RANZATO M A,CHOPRA S,AULI M,et al.Sequence Level Training with Recurrent Neural Networks[J].arXiv:1511.06732.
[83] LIM S H,XU H,MANNOR S.Reinforcement learning in robust markov decision processes[J].Mathematics of Operations Research,2016,41(4):1325-1353.
[84] LIU S Q,ZHU Z H,YE N,et al.Improved Image Captioning via Policy Gradient optimization of SPIDEr[C]//16th IEEE International Conference on Computer Vision(ICCV 2017).Venice,Italy:Institute of Electrical and Electronics Engineers Inc.,2017:873-881.
[85] VEDANTAM R,ZITNICK C L,PARIKH D.CIDEr:Consen-sus-based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2015).Boston,MA,United States:IEEE Computer Society,2015:4566-4575.
[86] ANDERSON P,FERNANDO B,JOHNSON M,et al.SPICE:Semantic Propositional Image Caption Evaluation[J].Adaptive Behavior,2016,11(4):382-398.
[87] GAO J L,WANG S Q,WANG S S,et al.Self-critical n-step training for image captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:6293-6301.
[88] CHEN J,JIN Q.Better Captioning with Sequence-Level Exploration[J].arXiv:2003.03749.
[89] JOHNSON J,KARPATHY A,LI F F.DenseCap:Fully convolutional localization networks for dense captioning[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:4565-4574.
[90] YANG L J,TANG K,YANG J C,et al.Dense captioning with joint inference and visual context[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United States:Institute of Electrical and Electronics Engineers Inc.,2017:1978-1987.
[91] KIM D J,CHOI J,OH T H,et al.Dense relational captioning:Triple-stream networks for relationship-based captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:6264-6273.
[92] YIN G J,SHENG L,LIU B,et al.Context and attribute grounded dense captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:6234-6243.
[93] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a Method for Automatic Evaluation of Machine Translation[C]//Procee-dings of the 40th Annual Meeting of the Association for Computational Linguistics.Istanbul,Turkey:Association for Computational Linguistics,2002:311-318.
[94] BANERJEE S,LAVIE A.METEOR:An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor,Michigan:ACL,2005:65-72.
[95] LIN C Y.Automatic Evaluation of Summaries Using n-gram Co-occurrence Statistics[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.United States:Association for Computational Linguistics,2003:71-78.
[96] LEE Y Y,KE H,YEN T Y,et al.Combining and learning word embedding with WordNet for semantic relatedness and similarity measurement[J].Journal of the Association for Information Science and Technology,2020,71(6):657-670.
[97] ROBERTSON S.Understanding inverse document frequency:on theoretical arguments for IDF[J].Journal of Documentation,2004,60(5):503-520.
[98] MOHRI M,ROARK B.Probabilistic context-free grammar induction based on structural zeros[C]//2006 Human Language Technology Conference-North American Chapter of the Association for Computational Linguistics Annual Meeting(HLT-NAACL 2006).New York,NY,United states:Association for Computational Linguistics (ACL),2006:312-319.
[99] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//13th European Conference on Computer Vision(ECCV 2014).Zurich,Switzerland:Springer Verlag,2014:740-755.
[100] HODOSH M,YOUNG P,HOCKENMAIER J.Framing image description as a ranking task:Data,models and evaluation metrics[J].Journal of Artificial Intelligence Research,2013,47(1):853-899.
[101] PLUMMER B A,WANG L W,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//15th IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile:Institute of Electrical and Electronics Engineers Inc.,2015:2641-2649.
[102] KRISHNA R,ZHU Y K,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[103] TRAN K,HE X D,ZHANG L,et al.Rich Image Captioning in the Wild[C]//29th IEEE Conference on Computer Vision and Pattern Recognition Workshops,CVPRW.Las Vegas,NV,United States:IEEE Computer Society,2016:434-441.
[104] GRUBINGER M,CLOUGH P,MÜLLER H,et al.The IAPR TC12 Benchmark:A New Evaluation Resource for Visual Information Systems[J].Workshop Ontoimage,2006,5(10):13-55.
[105] BYCHKOVSKY V,PARIS S,CHAN E,et al.Learning photographic global tonal adjustment with a database of input/output image pairs[C]//2011 IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2011.IEEE Computer Society,2011:97-104.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey of Image Captioning Methods

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2]	TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[3]	XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[4]	WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7]	SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8]	YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[9]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[10]	ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[11]	SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[12]	HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[13]	CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[14]	LAI Teng-fei, ZHOU Hai-yang, YU Fei-hong. Real-time Extend Depth of Field Algorithm for Video Processing [J]. Computer Science, 2022, 49(6A): 314-318.
[15]	WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324.