Computer Science ›› 2018, Vol. 45 ›› Issue (3): 23-28.doi: 10.11896/j.issn.1002-137X.2018.03.004

Previous Articles     Next Articles

Survey on Converting Image to Sentence Based on Depth Neural Networks

MAO Dian-hui, XUE Zi-yu, LI Zi-qin and WANG Fan   

  • Online:2018-03-15 Published:2018-11-13

Abstract: In the context of big data,the number of images increases rapidly,and knowledge acquisition is of great significance to the use and analysis of images.Image semantic analysis method based on deep model is a technique which can convert image content into intuitive understandable semantic knowledge through deep model,attracting wide attention at home and abroad.The target of image semantic analysis method can be divided into phrases,multiple tags,and statements.This paper introduced the research status of the above methods and their advantages,and analyzed the features of the image during the process of knowledge acquisition and the existing problems,including the structural features of convolutional neural network and the recurrent neural network.From the aspects such as model structure and connection,this paper analyzed the research hotspot and the cases,then analyzed the differences between academia and industry,and adopted image sentence conversion to excute a discriminant comparison.Finally,this paper drew a conclusion and gave its hope for the images semantic analysis method with deep model.

Key words: Deep model,Image semantic analysis,Convolutional neural network,Recurrent neural network,Support vector machine

[1] HUANG K Q,REN W Q,TAN T N.A Review on Image Object Classification and Detection[J].Journal of Computers,2014,7(6):1225-1240.(in Chinese) 黄凯奇,任伟强,谭铁牛.图像物体分类与检测算法综述[J].计算机学报,2014,37(6):1225-1240.
[2] GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation[C]∥CVPR’14.IEEE,2014:580-587.
[3] DOUZE M,SANDHAWALIA H,AMSALEG L.Evaluation of GIST descriptors for web-scale image search[C]∥CIVR’09.ACM,2009,19:1-8.
[4] LIAO Y F,HONG W T,WANG W J,et al.An overview of RNN-based mandarin speech recognition approaches[J].Journal of the Chinese Institute of Engineers,1999,22(5):535-547.
[5] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[J].Computer Science,2014:1-15.
[6] DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]∥Computer Vision and Pattern Recognition,2009.IEEE,2009:248-255.
[7] GRUBINGER M,CLOUGH P,MLLER H,et al.The IAPRTC12 Benchmark:A New Evaluation Resource for Visual Information Systems[C]∥International Workshop OntoImage.2006:1-11.
[8] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[C]∥ International Conference on Advances in Neural Information Processing Systems.Curran Associates Inc.,2012:1097-1105.
[9] SUTSKEVER I,VINYALS O,LE Q V,et al.Sequence to Sequence Learning with Neural Networks[J].Advances in Neural Information Processing Systems,2014,4:3104-3112.
[10] UIJLINGS J R,SANDE K E,GEVERS T,et al.Selective Search for Object Recognition[J].International Journal of Computer Vision,2013,104(2):154-171.
[11] JOACHIMS T.Making Large-Scale SVM Learning Practical[R].Technical Report,SFB 475:Komplexittsreduktion in Multiva-riaten Datenstrukturen,Universitt Dortmund,1998.
[12] HINTON G E,SRIVASTAVA N,KRIZHEVSKY A,et al.Improving neural networks by preventing co-adaptation of feature detectors[J].Computer Science,2012,3(4):212-223.
[13] NOROUZI M,MIKOLOV T,BENGIO S,et al.Zero-Shot Lear-ning by Convex Combination of Semantic Embeddings[J].arXiv Preprint arXiv:1312.5650,2013.
[14] HODOSH M,YOUNG P,HOCKENMAIER J.Framing image description as a ranking task:data,models and evaluation metrics[J].Journal of Artificial Intelligence Research,2013,47(1):853-899.
[15] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[16] KIROS R,SALAKHUTDINOV R,ZEMEL R S.Unifying Vi-sual-Semantic Embeddings with Multimodal Neural Language Models[J].arXiv Preprint arXiv:1411.2539,2014.
[17] YANG H,ZHOU J T,ZHANG Y,et al.Exploit Bounding Box Annotations for Multi-label Object Recognition[C]∥The IEEE Conference on Computer Vision and Pattern Recognition.2016:280-288.
[18] YANG F,CHOI W,LIN Y Q.Exploit All the Layers:Fast and Accurate CNN Object Detector With Scale Dependent Pooling and Cascaded Rejection Classifiers[C]∥Computer Vision and Pattern Recognition.IEEE,2016:2129-2137.
[19] ZHANG H,XU T,ELHOSEINY M,et al.SPDA-CNN:Unifying Semantic Part Detection and Abstraction for Fine-grained Recognition[C]∥Computer Vision and Pattern Recognition.IEEE,2016:1143-1152.
[20] MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[21] EVERINGHAM M,VAN GOOL L,WILLIAMS C K I,et al.The PASCAL Visual Object Classes[J].International Journal of Computer Vision,2010,88(2):303-338.
[22] ANGELOVA A,KRIZHEVSKY A,VANHOUCKE V,et al.Real-Time Pedestrian Detection with Deep Network Cascades[C]∥British Machine Vision Conference.2015:1-12.
[23] EVERINGHAM M.The Pascal Visual Object Classes (VOC) Challenge[J].International Journal of Computer Vision,2010,88(2):303-338.
[24] TIAN Y,LUO P,WANG X,et al.Deep Learning Strong Parts for Pedestrian Detection[C]∥The IEEE International Confe-rence on Computer Vision(ICCV).2015:1904-1912.
[25] KARPATHY A,FEI-FEI L.Deep visual-semantic alignmentsfor generating image descriptions[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:3128-3137.
[26] CHEN X,ZITNICK C L.Mind’s Eye:A Recurrent Visual Representation for Image Caption Generation[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2422-2431.
[27] MAO J,XU W,YANG Y,et al.Explain Images with Multimodal Recurrent Neural Networks[J].arXiv Preprint arXiv:1410.1090,2014
[28] OUYANG W,WANG X,ZENG X,et al.DeepID-Net:Deformable deep convolutional neural networks for object detection[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2015,46(5):2403-2412.
[29] KANG K,WANG X.Fully Convolutional Neural Networks for Crowd Segmentation[J].Computer Science,2014,49(1):25-30.
[30] LI B,WU T,ZHU S C.Integrating Context and Occlusion forCar Detection by Hierarchical And-Or Model[C]∥European Conference on Computer Vision(ECCV).2014:652-667.
[31] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.BabyTalk:Understanding and Generating Simple Image Descriptions[C]∥IEEE Conference on Computer Vision & Pattern Recognition.2013:1601-1608.
[32] LIANG X,HU Z,ZHANG H,et al.Recurrent Topic-Transition GAN for Visual Paragraph Generation[J].arXiv Preprint ar-Xiv:1703.07022,7.
[33] VENUGOPALAN S,XU H,DONAHUE J,et al.TranslatingVideos to Natural Language Using Deep Recurrent Neural Networks[J].arXiv Preprint arXiv:1412.4729,2014.
[34] DEVLIN J,ZBIB R,HUANG Z,et al.Fast and Robust Neural Network Joint Models for Statistical Machine Translation[C]∥Meeting of the Association for Computational Linguistics.2014,6(8):1370-1380.
[35] Microsoft.Mscoco[DB/OL].http://mscoco.org.
[36] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2625-2634.
[37] IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]∥Proceedings of the 32nd International Conference on Machine Learning(PMLR).2015:448-456.
[38] GRAVES A.Generating Sequences With Recurrent Neural Networks[J].arXiv Preprint arXiv:1308.0850,2013.
[39] WANG J,YANG Y,MAO J,et al.CNN-RNN:A UnifiedFramework for Multi-label Image Classification[C]∥The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:2285-2294.
[40] LIPTON Z C,BERKOWITZ J,ELKAN C.A Critical Review of Recurrent Neural Networks for Sequence Learning[J].Computer Science,arXiv Preprint arXiv:1506.00019,5.
[41] FANG H,PLATT J C,ZITNICK C L,et al.From captions to visual concepts and back[J].Computer Science,2014,2(7):1473-1482.
[42] BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[J].Computer Science,arXiv Preprint arXiv:1409.0473,2014.
[43] KANG K,OUYANG W L,LI H S,et al.Object Detection from Video Tubelets with Convolutional Neural Networks[C]∥Computer Vision and Pattern Recognition.IEEE,2016:817-825.
[44] TIRUMALA S S,NARAYANAN A.Hierarchical data classification using Deep Neural Networks[J].International Confe-rence on Neural Information Processing,2015,0(6):492-500.
[45] RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]∥NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.2010:139-147.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!