Computer Science ›› 2022, Vol. 49 ›› Issue (10): 151-158.doi: 10.11896/jsjkx.210900159

• Computer Graphics& Multimedia • Previous Articles     Next Articles

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

FANG Zhong-jun1,2, ZHANG Jing1, LI Dong-dong1,2   

  1. 1 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215031,China
  • Received:2021-09-22 Revised:2022-03-09 Online:2022-10-15 Published:2022-10-13
  • About author:FANG Zhong-jun,born in 1997,postgraduate,is a member of China Computer Federation.His main research interests include computer vision,neural networks and image captioning.
    ZHANG Jing,born in 1978,Ph.D,associate professor,is a member of China Computer Federation.Her main research interests include computer vision,image/video information retrieval,image annotation and so on.
  • Supported by:
    National Natural Science Foundation of China(61806078).

Abstract: Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

Key words: Image captioning, Transformer, Spatial encoding mechanism, Multi-level joint encoding mechanism, Attention mechanism

CLC Number: 

  • TP183
[1]MITCHELL M,DODGE J,GOYAL A,et al.Midge:Generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.2012:747-756.
[2]LU J,YANG J,BATRA D,et al.Neural baby talk[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7219-7228.
[3]DEVLIN J,CHENG H,FANG H,et al.Language models for image captioning:The quirks and what works[C]//Association for Computational Linguistics(ACL).2015:100-105.
[4]WANG C,YANG H,BARTZ C,et al.Image captioning with deep bidirectional LSTMs[C]//Proceedings of the 24th ACM international conference on Multimedia.2016:988-997.
[5]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[6]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:8928-8937.
[7]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[8]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[9]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:towardsreal-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[10]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[12]LU J,XIONG C,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383.
[13]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[14]GUO Y,LIU Y,DE BOER M H T,et al.A dual prediction network for image captioning[C]//2018 IEEE International Conference on Multimedia and Expo.IEEE,2018:1-6.
[15]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639.
[16]YAO T,PAN Y,LI Y,et al.Boosting image captioning with attributes[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4894-4902.
[17]FENG Y,LAN L,ZHANG X,et al.AttResNet:Attention-based ResNet for Image Captioning[C]//Proceedings of the 2018 International Conference on Algorithms,Computing and Artificial Intelligence.2018:1-6.
[18]LI N,CHEN Z.Image Cationing with Visual-Semantic LSTM[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:793-799.
[19]ZHANG J,LI K,WANG Z.Parallel-fusion LSTM with synchronous semantic and visual information for image captioning[J].Journal of Visual Communication and Image Representation,2021,75:103044.
[20]ZHANG Z,WU Q,WANG Y,et al.Exploring region relationships implicitly:Image captioning with visual relationship attention[J].Image and Vision Computing,2021,109:104146.
[21]PEI H,CHEN Q,WANG J,et al.Visual Relational Reasoning for Image Caption[C]//2020 International Joint Conference on Neural Networks.IEEE,2020:1-8.
[22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances inNeural Information Processing Systems.2017:5998-6008.
[23]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11137-11147.
[24]WANG D,HU H,CHEN D.Transformer with sparse self-attention mechanism for image captioning[J].Electronics Letters,2020,56(15):764-766.
[25]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[26]KARPATHY A,FEI-FEI L.Deep visual-semantic alignmentsfor generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[27]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics.2002:311-318.
[28]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//TextSummarization Branches Out.2004:74-81.
[30]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[31]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference for Learning Representations.2015.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[3] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[4] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[5] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[6] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[8] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[9] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[10] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[11] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[12] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[13] PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[14] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
[15] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!