基于深度神经网络的图像语句转换方法发展综述

doi:10.11896/j.issn.1002-137X.2018.03.004

摘要/Abstract

摘要： 在当前大数据时代,图像由于具有丰富的语义而成为大众获取相关信息的重要来源。基于深度模型的图像语义分析是一种通过深度模型将图像内容转换成可直观理解的语义知识的技术,受到了国内外研究者的广泛关注。该技术根据生成目标语义层次的差异,可分为单类别、多标签和语句3类。首先介绍了以上3类方法对应的深度模型的结构特点,并从技术的演化趋势角度对比分析了3类方法的技术特点和发展现状；然后重点对图像语句转换方法的发展现状、应用场景与性能要求的差异进行了论述,同时对图像语句转换方法的步骤进行分解和论述,从学术界和产业界两方面进行了详细的对比分析,指出了二者的不同研究侧重点与对应的发展现状；最后对具有深度模型的图像语句转换方法进行了总结和展望,指明了该方法当前存在的问题与发展趋势。

关键词: 深度模型,图像语义分析,卷积神经网络,递归神经网络,支持向量机

Abstract: In the context of big data,the number of images increases rapidly,and knowledge acquisition is of great significance to the use and analysis of images.Image semantic analysis method based on deep model is a technique which can convert image content into intuitive understandable semantic knowledge through deep model,attracting wide attention at home and abroad.The target of image semantic analysis method can be divided into phrases,multiple tags,and statements.This paper introduced the research status of the above methods and their advantages,and analyzed the features of the image during the process of knowledge acquisition and the existing problems,including the structural features of convolutional neural network and the recurrent neural network.From the aspects such as model structure and connection,this paper analyzed the research hotspot and the cases,then analyzed the differences between academia and industry,and adopted image sentence conversion to excute a discriminant comparison.Finally,this paper drew a conclusion and gave its hope for the images semantic analysis method with deep model.

Key words: Deep model,Image semantic analysis,Convolutional neural network,Recurrent neural network,Support vector machine

毛典辉,薛子育,李子沁,王帆. 基于深度神经网络的图像语句转换方法发展综述[J]. 计算机科学, 2018, 45(3): 23-28. https://doi.org/10.11896/j.issn.1002-137X.2018.03.004

MAO Dian-hui, XUE Zi-yu, LI Zi-qin and WANG Fan. Survey on Converting Image to Sentence Based on Depth Neural Networks[J]. Computer Science, 2018, 45(3): 23-28. https://doi.org/10.11896/j.issn.1002-137X.2018.03.004

参考文献

[1] HUANG K Q,REN W Q,TAN T N.A Review on Image Object Classification and Detection[J].Journal of Computers,2014,7(6):1225-1240.(in Chinese) 黄凯奇,任伟强,谭铁牛.图像物体分类与检测算法综述[J].计算机学报,2014,37(6):1225-1240.
[2] GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation[C]∥CVPR’14.IEEE,2014:580-587.
[3] DOUZE M,SANDHAWALIA H,AMSALEG L.Evaluation of GIST descriptors for web-scale image search[C]∥CIVR’09.ACM,2009,19:1-8.
[4] LIAO Y F,HONG W T,WANG W J,et al.An overview of RNN-based mandarin speech recognition approaches[J].Journal of the Chinese Institute of Engineers,1999,22(5):535-547.
[5] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[J].Computer Science,2014:1-15.
[6] DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]∥Computer Vision and Pattern Recognition,2009.IEEE,2009:248-255.
[7] GRUBINGER M,CLOUGH P,MLLER H,et al.The IAPRTC12 Benchmark:A New Evaluation Resource for Visual Information Systems[C]∥International Workshop OntoImage.2006:1-11.
[8] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[C]∥ International Conference on Advances in Neural Information Processing Systems.Curran Associates Inc.,2012:1097-1105.
[9] SUTSKEVER I,VINYALS O,LE Q V,et al.Sequence to Sequence Learning with Neural Networks[J].Advances in Neural Information Processing Systems,2014,4:3104-3112.
[10] UIJLINGS J R,SANDE K E,GEVERS T,et al.Selective Search for Object Recognition[J].International Journal of Computer Vision,2013,104(2):154-171.
[11] JOACHIMS T.Making Large-Scale SVM Learning Practical[R].Technical Report,SFB 475:Komplexittsreduktion in Multiva-riaten Datenstrukturen,Universitt Dortmund,1998.
[12] HINTON G E,SRIVASTAVA N,KRIZHEVSKY A,et al.Improving neural networks by preventing co-adaptation of feature detectors[J].Computer Science,2012,3(4):212-223.
[13] NOROUZI M,MIKOLOV T,BENGIO S,et al.Zero-Shot Lear-ning by Convex Combination of Semantic Embeddings[J].arXiv Preprint arXiv:1312.5650,2013.
[14] HODOSH M,YOUNG P,HOCKENMAIER J.Framing image description as a ranking task:data,models and evaluation metrics[J].Journal of Artificial Intelligence Research,2013,47(1):853-899.
[15] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[16] KIROS R,SALAKHUTDINOV R,ZEMEL R S.Unifying Vi-sual-Semantic Embeddings with Multimodal Neural Language Models[J].arXiv Preprint arXiv:1411.2539,2014.
[17] YANG H,ZHOU J T,ZHANG Y,et al.Exploit Bounding Box Annotations for Multi-label Object Recognition[C]∥The IEEE Conference on Computer Vision and Pattern Recognition.2016:280-288.
[18] YANG F,CHOI W,LIN Y Q.Exploit All the Layers:Fast and Accurate CNN Object Detector With Scale Dependent Pooling and Cascaded Rejection Classifiers[C]∥Computer Vision and Pattern Recognition.IEEE,2016:2129-2137.
[19] ZHANG H,XU T,ELHOSEINY M,et al.SPDA-CNN:Unifying Semantic Part Detection and Abstraction for Fine-grained Recognition[C]∥Computer Vision and Pattern Recognition.IEEE,2016:1143-1152.
[20] MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[21] EVERINGHAM M,VAN GOOL L,WILLIAMS C K I,et al.The PASCAL Visual Object Classes[J].International Journal of Computer Vision,2010,88(2):303-338.
[22] ANGELOVA A,KRIZHEVSKY A,VANHOUCKE V,et al.Real-Time Pedestrian Detection with Deep Network Cascades[C]∥British Machine Vision Conference.2015:1-12.
[23] EVERINGHAM M.The Pascal Visual Object Classes (VOC) Challenge[J].International Journal of Computer Vision,2010,88(2):303-338.
[24] TIAN Y,LUO P,WANG X,et al.Deep Learning Strong Parts for Pedestrian Detection[C]∥The IEEE International Confe-rence on Computer Vision(ICCV).2015:1904-1912.
[25] KARPATHY A,FEI-FEI L.Deep visual-semantic alignmentsfor generating image descriptions[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:3128-3137.
[26] CHEN X,ZITNICK C L.Mind’s Eye:A Recurrent Visual Representation for Image Caption Generation[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2422-2431.
[27] MAO J,XU W,YANG Y,et al.Explain Images with Multimodal Recurrent Neural Networks[J].arXiv Preprint arXiv:1410.1090,2014
[28] OUYANG W,WANG X,ZENG X,et al.DeepID-Net:Deformable deep convolutional neural networks for object detection[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2015,46(5):2403-2412.
[29] KANG K,WANG X.Fully Convolutional Neural Networks for Crowd Segmentation[J].Computer Science,2014,49(1):25-30.
[30] LI B,WU T,ZHU S C.Integrating Context and Occlusion forCar Detection by Hierarchical And-Or Model[C]∥European Conference on Computer Vision(ECCV).2014:652-667.
[31] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.BabyTalk:Understanding and Generating Simple Image Descriptions[C]∥IEEE Conference on Computer Vision & Pattern Recognition.2013:1601-1608.
[32] LIANG X,HU Z,ZHANG H,et al.Recurrent Topic-Transition GAN for Visual Paragraph Generation[J].arXiv Preprint ar-Xiv:1703.07022,7.
[33] VENUGOPALAN S,XU H,DONAHUE J,et al.TranslatingVideos to Natural Language Using Deep Recurrent Neural Networks[J].arXiv Preprint arXiv:1412.4729,2014.
[34] DEVLIN J,ZBIB R,HUANG Z,et al.Fast and Robust Neural Network Joint Models for Statistical Machine Translation[C]∥Meeting of the Association for Computational Linguistics.2014,6(8):1370-1380.
[35] Microsoft.Mscoco[DB/OL].http://mscoco.org.
[36] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2625-2634.
[37] IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]∥Proceedings of the 32nd International Conference on Machine Learning(PMLR).2015:448-456.
[38] GRAVES A.Generating Sequences With Recurrent Neural Networks[J].arXiv Preprint arXiv:1308.0850,2013.
[39] WANG J,YANG Y,MAO J,et al.CNN-RNN:A UnifiedFramework for Multi-label Image Classification[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:2285-2294.
[40] LIPTON Z C,BERKOWITZ J,ELKAN C.A Critical Review of Recurrent Neural Networks for Sequence Learning[J].Computer Science,arXiv Preprint arXiv:1506.00019,5.
[41] FANG H,PLATT J C,ZITNICK C L,et al.From captions to visual concepts and back[J].Computer Science,2014,2(7):1473-1482.
[42] BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[J].Computer Science,arXiv Preprint arXiv:1409.0473,2014.
[43] KANG K,OUYANG W L,LI H S,et al.Object Detection from Video Tubelets with Convolutional Neural Networks[C]∥Computer Vision and Pattern Recognition.IEEE,2016:817-825.
[44] TIRUMALA S S,NARAYANAN A.Hierarchical data classification using Deep Neural Networks[J].International Confe-rence on Neural Information Processing,2015,0(6):492-500.
[45] RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]∥NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.2010:139-147.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed