计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 149-160.doi: 10.11896/jsjkx.200500039

• 计算机图形学与多媒体 • 上一篇    下一篇

图像描述技术综述

苗益1, 赵增顺1,2,3, 杨雨露1, 徐宁1, 杨皓然1, 孙骞1   

  1. 1 山东科技大学电子信息工程学院 山东 青岛 266590
    2 山东大学控制科学与工程学院 济南 250061
    3 佛罗里达大学电子与计算机工程系 佛罗里达州 盖恩斯维尔 32611
  • 收稿日期:2020-05-11 修回日期:2020-08-13 出版日期:2020-12-15 发布日期:2020-12-17
  • 通讯作者: 赵增顺(zhaozengshun@163.com)
  • 作者简介:617544375@qq.com
  • 基金资助:
    国家自然科学基金(61403281);中国博士后科学基金(2015T80717);山东省自然科学基金(ZR2014FM002)

Survey of Image Captioning Methods

MIAO Yi1, ZHAO Zeng-shun1,2,3, YANG Yu-lu1, XU Ning1, YANG Hao-ran1, SUN Qian1   

  1. 1 College of Electronic and Information Engineering Shandong University of Science and Technology Qingdao Shandong 266590,China
    2 School of Control Science and Engineering Shandong UniversityJinan 250061,China
    3 Department of Electrical & Computer Engineering University of Florida Gainesville Florida 32611USA
  • Received:2020-05-11 Revised:2020-08-13 Online:2020-12-15 Published:2020-12-17
  • About author:MIAO Yi,born in 1996postgraduate.His main research interests includeima-ge processing and analysis.
    ZHAO Zeng-shun,born in 1975Ph.Dassociate professorPh.D supervisor.His main research interests include computer visionintelligent robots and machine learning.
  • Supported by:
    National Natural Science Foundation of China(61403281),China Postdoctoral Science Foundation(2015T80717) and Natural Science Foundation of Shandong Province,China(ZR2014FM002).

摘要: 图像描述技术就是以图像为输入通过数学模型和计算使计算机输出对应图像的自然语言描述文字使计算机拥有"看图说话"的能力是图像处理领域中继图像识别、图像分割和目标跟踪之后的又一新型任务.文中以图像描述技术的发展历程为主线对图像描述任务的方法、评价指标和常用数据集进行了详细的综述.针对图像描述任务的技术方法总结了基于模板、检索和深度学习的图像描述生成方法重点介绍了基于深度学习的图像描述的多种方法并对不同方法的实验结果进行了总结和讨论;详细介绍了图像描述任务的实验结果评价指标及其计算方法和该任务中常用的数据集;最后提出了该任务现有的问题和未来的发展方向.

关键词: 图像处理, 图像描述, 深度学习, 计算机视觉, 自然语言处理

Abstract: Image captioning is a task that uses an image as input to generate the natural language description of this image by modeling and calculationso that computers have the ability to "talk about the pictures".It is another new type of computer vision task after image recognitionimage segmentation and target tracking.This paper focuses on the development of image captioning and gives a detailed survey of the image captioning methods based on templateretrieval and deep learning.And this paper especially focuses on the deep learning-based methods and discusses the experimental results of various methods.Experimental evalu-ation indexes and the common datasets used in this field are introduced in detail.Finallythis paper points out the problems and research directions in the future.

Key words: Image processing, Image captioning, Deep learning, Computer vision, Natural language processing

中图分类号: 

  • TP301
[1] XU H J,HUANG C Q,HUANG X D,et al.Multi-modal multi-concept-based deep neural network for automatic image annotation[J].Multimedia Tools and Applications,2019,78(21):30651-30675.
[2] ROYA R,MANSOUR J.Image annotation using multi-viewnon-negative matrix factorization with different number of basis vectors[J].Journal of Visual Communication and Image Representation,2017,46(1):1-12.
[3] ZHANG Z,ZHAO Y X,LI D,et al.A novel image annotation model based on content representation with multi-layer segmentation[J].Neural Computing and Applications,2015,26(6):1407-1422.
[4] REN Y M,CHENG X Y,LI X Y,et al.Description and Recognition of Image Based on Concept Semantics[J].ComputerScien-ce,2008,35(7):206-212.
[5] XIE D N,ROSS G,PIOTR D,et al.Aggregated residual transformations for deep neural networks[C]//30th IEEE Confe-rence on Computer Vision and Pattern Recognition,CVPR 2017.Honolulu,HI,United states:Institute of Electrical and Electronics Engineers Inc.,2017:5987-5995.
[6] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:770-778.
[7] CHRISTIAN S,LIU W,YANG Q J,et al.Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2015).Boston,MA,United States:IEEE Computer Society,2015:1-9.
[8] CHRISTIAN S,SERGEY I,VINCENT V,et al.Inception-v4,inception-ResNet and the impact of residual connections on learning[C]//31st AAAI Conference on Artificial Intelligence(AAAI 2017).San Francisco,CA,United States:AAAI Press,2017:4278-4284.
[9] SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the Inception Architecture for Computer Vision[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:2818-2826.
[10] ZHANG W W,ZHOU H,SUN S Y,et al.Robust multi-modality multi-object tracking[C]//17th IEEE/CVF International Conference on Computer Vision(ICCV 2019).Seoul,Korea,Republic of Institute of Electrical and Electronics Engineers Inc.,2019:2365-2374.
[11] KAREN S,ANDREW Z.Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations(ICLR 2015).San Diego,CA,Uni-ted States:International Conference on Learning Representations,ICLR,2015.
[12] STEFANO M.Knowledge enhanced representations to reduce the semantic gap in clinical decision support[C]//9th PhD Symposium on Future Directions in Information Access(FDIA 2019).Milan,Italy:CEUR-WS,2019:4-9.
[13] TANG J H,ZHA Z J,TAO D C,et al.Semantic-Gap-Oriented Active Learning for Multilabel Image Annotation[J].Ieee Transactions on Image Processing,2012,21(4):2354-2360.
[14] YANG X,TANG K H,ZHANG H W,et al.Auto-encodingscene graphs for image captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:10677-10686.
[15] LU J S,XIONG C M,DEVI P,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United states:Institute of Electrical and Electronics Engineers Inc.,2017:3242-3250.
[16] CHEN S Z,JIN Q,WANG P,et al.Say As You Wish:Fine-grained Control of Image Caption Generation with Scene Graphs[J].arXiv:2003.00387.
[17] CHEN T S,LIN L,ZUO W M,et al.Learning a wavelet-like auto-encoder to accelerate deep neural networks[C]//32nd AAAI Conference on Artificial Intelligence(AAAI 2018).New Orlea-ns,LA,United states:AAAI press,2018:6722-6729.
[18] HYUN K,HYUNSOO Y,KI-WOONG P.Multi-targeted backdoor:Indentifying backdoor attack for multiple deep neural networks[J].IEICE Transactions on Information and Systems,2020,E103D(4):883-887.
[19] OORD A V D,LI Y Z,BABUSCHKIN I,et al.ParallelWaveNet:Fast high-fidelity speech synthesis[C]//35th International Conference on Machine Learning,ICML.Stockholm,Sweden:International Machine Learning Society (IMLS),2018:6270-6278.
[20] ZHU J Y,ZHANG R,PATHAK D,et al.Toward multimodal image-to-image translation[C]//31st Annual Conference on Neural Information Processing Systems(NIPS 2017).Long Beach,CA,United States:Neural Information Processing Systems Foundation,2017:466-477.
[21] WANG Q,MAO Z D,WANG B,et al.Knowledge graph embedding:A survey of approaches and applications[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(12):2724-2743.
[22] HINTON G,OSINDERO S,TEH Y W.A fast learning algorithm for deep belief nets[J].Neural computation,2006,18(7):1527-1554.
[23] WANG Y Y,WANG L,QI J,et al.Improved text clustering algorithm and application in microblogging public opinion analysis[C]//2013 4th World Congress on Software Engineering(WCSE 2013).Hong Kong,China:IEEE Computer Society,2013:27-31.
[24] YANG Y X.Research and Realization of Internet Public Opinion Analysis Based on Improved TF-IDF Algorithm[C]//16th International Symposium on Distributed Computing and Applications to Business,Engineering and Science(DCABES 2017).AnYang,He Nan,China:Institute of Electrical and Electronics Engineers Inc.,2017:80-83.
[25] IOFFE S,SZEGEDY C.Batch normalization:Accelerating deep network training by reducing internal covariate shift[C]//32nd International Conference on Machine Learning(ICML 2015).Lile,France:International Machine Learning Society (IMLS),2015:448-456.
[26] HE K M,ZHANG X Y,REN S Q,et al.Identity mappings in deep residual networks[C]//21st ACM Conference onCompu-ter and Communications Security(CCS 2014).Scottsdale,AZ,United states:Springer Verlag,2016:630-645.
[27] WU Z F,SHEN C H,HENGEL A V D.Wider or Deeper:Revisiting the ResNet Model for Visual Recognition[J].Pattern Recognition,2019,90:119-133.
[28] WANG N,SONG Y B,MA C,et al.Unsupervised deep tracking[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:1308-1317.
[29] ANDERSON P,HE X D,BUEHLER C,et al.Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering[C]//31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2018).Salt Lake City,UT,United States:IEEE Computer Society,2018:6077-6086.
[30] DESHPANDE A,ANEJA J,WANG L W,et al.Fast,diverse and accurate image captioning guided by part-of-speech[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United states:IEEE Computer Society,2019:10687-10696.
[31] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2015).Boston,MA,United States:IEEE Computer Society,2015:3156-3164.
[32] XU K,BA J L,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//32nd International Conference on Machine Learning.Lile,France:International Machine Learning Society (IMLS),2015:2048-2057.
[33] CORNIA M,BARALDI L,CUCCHIARA R.Show,control and tell:A framework for generating controllable and grounded captions[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:8299-8308.
[34] HOSSAIN M Z,SOHEL F,SHIRATUDDIN M F,et al.A comprehensive survey of deep learning for image captioning[J].ACM Computing Surveys,2019,51(6):118:1-118:36.
[35] YAGI M,SHIBATA T,TAKADA K.Human-perception-likeimage recognition system based on the Associative Processor architecture[C]//11th European Signal Processing Conference,EUSIPCO.Toulouse,France:European Signal Processing Conference,EUSIPCO,2002.
[36] ITO S,MITSUKURA Y,FUKUMI M,et al.The image recognition system by using the FA and SNN[C]//7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems(KES 2003).Oxford,United Kingdom:Springer Verlag,2003:578-584.
[37] KEYSERS D,DESELAERS T,NEY H.Pixel-to-pixel matching for image recognition using Hungarian graph matching[C]//26th DAGM Symposium on Pattern Recognition.Tubingen,Germany:Springer Verlag,2004:154-162.
[38] FARHADI A,HEJRATI S M M,SADEGHI M A,et al.Every Picture Tells a Story:Generating Sentences from Images[J].lecture notes in computer science,2010,21(10):15-29.
[39] DAVID V,SANCHEZ A.Advanced support vector machinesand kernel methods[J].Neurocomputing,2003,55(1/2):5-20.
[40] CHEN P H,LIN C J,SCHOLKOPF B.A tutorial on v-support vector machines[J].Applied Stochastic Models in Business and Industry,2005,21(2):111-136.
[41] EVERINGHAM M,GOOL L V,WILLIAMS C K I,et al.The Pascal Visual Object Classes (VOC) challenge[J].International Journal of Computer Vision,2010,88(2):303-338.
[42] LI S M,KULKARNI G,BERG T L,et al.Composing simpleimage descriptions using web-scale N-grams[C]//15th Confe-rence on Computational Natural Language Learning(CoNLL 2011).Portland,OR,United states:Association for Computational Linguistics (ACL),2011:220-228.
[43] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Baby talk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903.
[44] QI Y,SZUMMER M,MINKA T P.Bayesian conditional random fields[C]//10th International Workshop on Artificial Intelligence and Statistics(AISTATS 2005).Hastings,Christ Church,Barbados:The Society for Artificial Intelligence and Statistics,2005:269-276.
[45] SUTTON C,MCCALLUM A,ROHANIMANESH K.Dynamic conditional random fields:Factorized probabilistic models for labeling and segmenting sequence data[J].Journal of Machine Learning Research,2007,8(2):693-723.
[46] MITCHELL M,HAN X F,DODGE J,et al.Midge:Generating image descriptions from computer vision detections[C]//13th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2012).Avignon,France:Association for Computational Linguistics (ACL),2012:747-756.
[47] NOR W,MOHAMED H W,SALLEH M N M,et al.A comparative study of Reduced Error Pruning method in decision tree algorithms[C]//2012 IEEE International Conference on Control System,Computing and Engineering(ICCSCE 2012).Penang,Malaysia:IEEE Computer Society,2012:392-397.
[48] ORDONEZ V,KULKARNI G,BERG T L.Im2Text:Describing images using 1 million captioned photographs[C]//25th Annual Conference on Neural Information Processing Systems 2011(NIPS 2011).Granada,Spain:Curran Associates Inc.,2011.
[49] SOCHER R,KARPATHY A,LE Q V,et al.Grounded Compositional Semantics for Finding and Describing Images with Sentences[J].Transactions of the Association for Computational Linguistics,2014,2(Q14-1017):207-218.
[50] KUZNETSOVA P,ORDONEZ V,BERG T L,et al.TreeTalk:Composition and Compression of Trees for Image Descriptions[J].Transactions of the Association for Computational Linguistics,2014,2(Q14-1017):351-362.
[51] MASON R,CHARNIAK R.Nonparametric method for data-driven image captioning[C]//52nd Annual Meeting of the Association for Computational Linguistics(ACL 2014).Baltimore,MD,United states:Association for Computational Linguistics (ACL),2014:592-598.
[52] SUN C,GAN C,NEVATIA R.Automatic concept discoveryfrom parallel text and visual corpora[C]//15th IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile:Institute of Electrical and Electronics Engineers Inc.,2015:2596-2604.
[53] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[J].arXiv:1406.1078.
[54] LI M D,MU K,ZHONG P,et al.Generating steganographic ima-ge description by dynamic synonym substitution[J].Signal Processing,2019,164:193-201.
[55] KIROS R,SALAKHUTDINOV R,ZEMEL R.Multimodal neural language models[C]//31st International Conference on Machine Learning(ICML 2014).Beijing,China:International Machine Learning Society (IMLS),2014:2012-2025.
[56] MAO J H,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-RNN)[C]//3rd International Conference on Learning Representations(ICLR 2015).San Diego,CA,United States,2015.
[57] HERMANS M,SCHRAUWEN B.Memory in linear recurrent neural networks in continuous time[J].Neural Networks,2010,23(3):341-355.
[58] GREFF K,SRIVASTAVA R K,KOUTNIK J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks and Learning Systems,2017,28(10):2222-2232.
[59] CHINEA A.Understanding the principles of recursive neuralnetworks:A generative approach to tackle model complexity[C]//19th International Conference on Artificial Neural Networks(ICANN 2009).Limassol,Cyprus:Springer Verlag,2009:952-963.
[60] SHEN Y K,TAN S,SORDONI A,et al.Ordered neurons:Integrating tree structures into recurrent neural networks[C]//7th International Conference on Learning Representations(ICLR 2019).New Orleans,LA,United States,2019.
[61] JIA X,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//15th IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile:Institute of Electrical and Electronics Engineers Inc.,2015:2407-2415.
[62] MAO J H,HUANG J,TOSHEV A,et al.Generation and comprehension of unambiguous object descriptions[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:11-20.
[63] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//3rd International Conference on Learning Representations(ICLR 2015).San Diego,CA,United States,2015.
[64] XIAO T J,XU Y C,YANG K Y,et al.The application of two-level attention models in deep convolutional neural network for fine-grained image classification[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Boston,MA,United States:IEEE Computer Society,2015:842-850.
[65] STOLLENGA M F,MASCI J,GOMEZ F,et al.Deep networks with internal selective attention through feedback connections[C]//28th Annual Conference on Neural Information Proces-sing Systems 2014(NIPS 2014).Montreal,QC,Canada,2014:3545-3553.
[66] CHU X,YANG W,OUYANG W,et al.Multi-context attention for human pose estimation[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United States:Institute of Electrical and Electronics Engineers Inc.,2017:5669-5678.
[67] ZHAO B,WU X,FENG J S,et al.Diversified Visual Attention Networks for Fine-Grained Object Classification[J].IEEE Transactions on Multimedia,2017,19(6):1245-1256.
[68] DENG Z P,SUN H,ZHOU S L,et al.Toward Fast and Accurate Vehicle Detection in Aerial Images Using Coupled Region-Based Convolutional Neural Networks[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sen-sing,2017,10(8):3652-3664.
[69] ZHOU Y E,WANG M,LIU D Q,et al.More Grounded Image Captioning by Distilling Image-Text Matching Model[J].arXiv:2004.00390.
[70] LEE K H,CHEN X,HUA G,et al.Stacked Cross Attention for Image-Text Matching[C]//15th European Conference on Computer Vision(ECCV 2018).Munich,Germany:Springer Verlag,2018:212-228.
[71] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//30th IEEEConfe-rence on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United States:Institute of Electrical and Electronics Engineers Inc.,2017:1179-1195.
[72] ZHAO Z S,GAO H X,SUN Q,et al.Latest Development of the Theory Framework,Derivative Model and Application of Genera-tive Adversarial Nets [J].Journal of Chinese Mini-Micro Computer Systems,2018,39(12):44-48.
[73] ZHAO Z S,SUN Q,YANG H R,et al.Compression Artifacts Reduction by Improved Generative Adversarial Networks[J/OL].Journal on Image and Video Processing,2019,https:∥doi.org/10.1186/s13640-019-0465-0.
[74] DAI B,SANJA F,RAQUEL U,et al.Towards Diverse and Na-tural Image Descriptions via a Conditional GAN[C]//16th IEEE International Conference on Computer Vision(ICCV 2017).Venice,Italy:Institute of Electrical and Electronics Engineers Inc.,2017:2989-2998.
[75] CHEN F H,JI R R,SUN X S,et al.GroupCap:Group-Based Image Captioning with Structured Relevance and Diversity Cons-traints[C]//31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2018).Salt Lake City,UT,United States:IEEE Computer Society,2018:1345-1353.
[76] DOGNIN P,MELNYK I,MROUEH Y,et al.Adversarial semantic alignment for improved image captions[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:10455-10463.
[77] FENG Y,MA L,LIU W,et al.Unsupervised image captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:4120-4129.
[78] GUO L T,LIU J,YAO P,et al.MSCAP:Multi-style image captioning with unpaired stylized text[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:4199-4208.
[79] ZHAO W T,WU X X,ZHANG X X.MemCap:MemorizingStyle Knowledge for Image Captioning[C]//The Thirty-Fourth AAAI Conference on Artificial Intelligence(AAAI 2020).New York,NY,USA,2020:12984-12992.
[80] SHETTY R,ROHRBACH M,HENDRICKS L A,et al.Spea-king the Same Language:Matching Machine to Human Captions by Adversarial Training[C]//16th IEEE International Confe-rence on Computer Vision(ICCV 2017).Venice,Italy:Institute of Electrical and Electronics Engineers Inc.,2017:4155-4164.
[81] TESAURO G.Temporal difference learning and TD-gammon[J].Communications of the ACM,1995,38(3):58-68.
[82] RANZATO M A,CHOPRA S,AULI M,et al.Sequence Level Training with Recurrent Neural Networks[J].arXiv:1511.06732.
[83] LIM S H,XU H,MANNOR S.Reinforcement learning in robust markov decision processes[J].Mathematics of Operations Research,2016,41(4):1325-1353.
[84] LIU S Q,ZHU Z H,YE N,et al.Improved Image Captioning via Policy Gradient optimization of SPIDEr[C]//16th IEEE International Conference on Computer Vision(ICCV 2017).Venice,Italy:Institute of Electrical and Electronics Engineers Inc.,2017:873-881.
[85] VEDANTAM R,ZITNICK C L,PARIKH D.CIDEr:Consen-sus-based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2015).Boston,MA,United States:IEEE Computer Society,2015:4566-4575.
[86] ANDERSON P,FERNANDO B,JOHNSON M,et al.SPICE:Semantic Propositional Image Caption Evaluation[J].Adaptive Behavior,2016,11(4):382-398.
[87] GAO J L,WANG S Q,WANG S S,et al.Self-critical n-step training for image captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:6293-6301.
[88] CHEN J,JIN Q.Better Captioning with Sequence-Level Exploration[J].arXiv:2003.03749.
[89] JOHNSON J,KARPATHY A,LI F F.DenseCap:Fully convolutional localization networks for dense captioning[C]//29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2016).Las Vegas,NV,United States:IEEE Computer Society,2016:4565-4574.
[90] YANG L J,TANG K,YANG J C,et al.Dense captioning with joint inference and visual context[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017).Honolulu,HI,United States:Institute of Electrical and Electronics Engineers Inc.,2017:1978-1987.
[91] KIM D J,CHOI J,OH T H,et al.Dense relational captioning:Triple-stream networks for relationship-based captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:6264-6273.
[92] YIN G J,SHENG L,LIU B,et al.Context and attribute grounded dense captioning[C]//32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).Long Beach,CA,United States:IEEE Computer Society,2019:6234-6243.
[93] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a Method for Automatic Evaluation of Machine Translation[C]//Procee-dings of the 40th Annual Meeting of the Association for Computational Linguistics.Istanbul,Turkey:Association for Computational Linguistics,2002:311-318.
[94] BANERJEE S,LAVIE A.METEOR:An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor,Michigan:ACL,2005:65-72.
[95] LIN C Y.Automatic Evaluation of Summaries Using n-gram Co-occurrence Statistics[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.United States:Association for Computational Linguistics,2003:71-78.
[96] LEE Y Y,KE H,YEN T Y,et al.Combining and learning word embedding with WordNet for semantic relatedness and similarity measurement[J].Journal of the Association for Information Science and Technology,2020,71(6):657-670.
[97] ROBERTSON S.Understanding inverse document frequency:on theoretical arguments for IDF[J].Journal of Documentation,2004,60(5):503-520.
[98] MOHRI M,ROARK B.Probabilistic context-free grammar induction based on structural zeros[C]//2006 Human Language Technology Conference-North American Chapter of the Association for Computational Linguistics Annual Meeting(HLT-NAACL 2006).New York,NY,United states:Association for Computational Linguistics (ACL),2006:312-319.
[99] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//13th European Conference on Computer Vision(ECCV 2014).Zurich,Switzerland:Springer Verlag,2014:740-755.
[100] HODOSH M,YOUNG P,HOCKENMAIER J.Framing image description as a ranking task:Data,models and evaluation metrics[J].Journal of Artificial Intelligence Research,2013,47(1):853-899.
[101] PLUMMER B A,WANG L W,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//15th IEEE International Conference on Computer Vision(ICCV 2015).Santiago,Chile:Institute of Electrical and Electronics Engineers Inc.,2015:2641-2649.
[102] KRISHNA R,ZHU Y K,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[103] TRAN K,HE X D,ZHANG L,et al.Rich Image Captioning in the Wild[C]//29th IEEE Conference on Computer Vision and Pattern Recognition Workshops,CVPRW.Las Vegas,NV,United States:IEEE Computer Society,2016:434-441.
[104] GRUBINGER M,CLOUGH P,MÜLLER H,et al.The IAPR TC12 Benchmark:A New Evaluation Resource for Visual Information Systems[J].Workshop Ontoimage,2006,5(10):13-55.
[105] BYCHKOVSKY V,PARIS S,CHAN E,et al.Learning photographic global tonal adjustment with a database of input/output image pairs[C]//2011 IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2011.IEEE Computer Society,2011:97-104.
[1] 赵佳琦, 王瀚正, 周勇, 张迪, 周子渊. 基于多尺度与注意力特征增强的遥感图像描述生成方法[J]. 计算机科学, 2021, 48(1): 190-196.
[2] 王瑞平, 贾真, 刘畅, 陈泽威, 李天瑞. 基于DeepFM的深度兴趣因子分解机网络[J]. 计算机科学, 2021, 48(1): 226-232.
[3] 于文家, 丁世飞. 基于自注意力机制的条件生成对抗网络[J]. 计算机科学, 2021, 48(1): 241-246.
[4] 仝鑫, 王斌君, 王润正, 潘孝勤. 面向自然语言处理的深度学习对抗样本综述[J]. 计算机科学, 2021, 48(1): 258-267.
[5] 陆龙龙, 陈统, 潘敏学, 张天. CodeSearcher:基于自然语言功能描述的代码查询[J]. 计算机科学, 2020, 47(9): 1-9.
[6] 丁钰, 魏浩, 潘志松, 刘鑫. 网络表示学习算法综述[J]. 计算机科学, 2020, 47(9): 52-59.
[7] 田野, 寿黎但, 陈珂, 骆歆远, 陈刚. 基于字段嵌入的数据库自然语言查询接口[J]. 计算机科学, 2020, 47(9): 60-66.
[8] 何鑫, 许娟, 金莹莹. 行为关联网络:完整的变化行为建模[J]. 计算机科学, 2020, 47(9): 123-128.
[9] 叶亚男, 迟静, 于志平, 战玉丽, 张彩明. 基于改进CycleGan模型和区域分割的表情动画合成[J]. 计算机科学, 2020, 47(9): 142-149.
[10] 邓良, 许庚林, 李梦杰, 陈章进. 基于深度学习与多哈希相似度加权实现快速人脸识别[J]. 计算机科学, 2020, 47(9): 163-168.
[11] 暴雨轩, 芦天亮, 杜彦辉. 深度伪造视频检测技术综述[J]. 计算机科学, 2020, 47(9): 283-292.
[12] 徐守坤, 倪楚涵, 吉晨晨, 李宁. 基于YOLOv3的施工场景安全帽佩戴的图像描述[J]. 计算机科学, 2020, 47(8): 233-240.
[13] 袁野, 和晓歌, 朱定坤, 王富利, 谢浩然, 汪俊, 魏明强, 郭延文. 视觉图像显著性检测综述[J]. 计算机科学, 2020, 47(7): 84-91.
[14] 王文刀, 王润泽, 魏鑫磊, 漆云亮, 马义德. 基于堆叠式双向LSTM的心电图自动识别算法[J]. 计算机科学, 2020, 47(7): 118-124.
[15] 刘燕, 温静. 基于注意力机制的复杂场景文本检测[J]. 计算机科学, 2020, 47(7): 135-140.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[2] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[3] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[4] 施超,谢在鹏,柳晗,吕鑫. 基于稳定匹配的容器部署策略的优化[J]. 计算机科学, 2018, 45(4): 131 -136 .
[5] 庞博,金乾坤,合尼古力·吾买尔,齐兴斌. 软件定义网络中基于网络切片和ILP模型的路由方案[J]. 计算机科学, 2018, 45(4): 143 -147 .
[6] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[7] 罗霄阳,霍宏涛,王梦思,陈亚飞. 基于多残差马尔科夫模型的图像拼接检测[J]. 计算机科学, 2018, 45(4): 173 -177 .
[8] 朱淑芹,王文宏,李俊青. 针对基于感知器模型的混沌图像加密算法的选择明文攻击[J]. 计算机科学, 2018, 45(4): 178 -181 .
[9] 文俊浩,孙光辉,李顺. 基于用户聚类和移动上下文的矩阵分解推荐算法研究[J]. 计算机科学, 2018, 45(4): 215 -219 .
[10] 侯彦娥,孔云峰,党兰学. 求解多车型校车路径问题的混合集合划分的GRASP算法[J]. 计算机科学, 2018, 45(4): 240 -246 .