Computer Science ›› 2022, Vol. 49 ›› Issue (10): 191-197.doi: 10.11896/jsjkx.220600009

• Computer Graphics& Multimedia • Previous Articles     Next Articles

Cross-scale Feature Fusion Self-attention for Image Captioning

WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan   

  1. School of Computer Science,Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China
    Beijing Institute of Artificial Intelligence,Beijing University of Technology,Beijing 100124,China
  • Received:2022-06-01 Revised:2022-07-05 Online:2022-10-15 Published:2022-10-13
  • About author:WANG Ming-zhan,born in 1997,master.His main research interests include image captioning and so on.
    ZHANG Xiao-dan,born in 1987,Ph.D.Her main research interests include image captioning,image processing,computer vision and natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61906007) and R & D Program of Beijing Municipal Education Commission(KM202110005022,KZ202210005009).

Abstract: In recent years,the encoder-decoder framework based on self-attention mechanism has become the mainstream model in image captioning.However,self-attention in the encoder only models the visual relations of low-scale features,ignoring some effective information in high-scale visual features,thus affecting the quality of the generated descriptions.To solve this problem,this paper proposes a cross-scale feature fusion self-attention(CFFSA) method for image captioning.Specifically,CFFSA integrates low-scale and high-scale visual features in self-attention to improve the range of attention from a visual perspective,which increases effective visual information and reduces noise,thereby learning more accurate visual and semantic relationships.Experiments on MS COCO dataset show that the proposed method can more accurately capture the relationship between cross-scale visual features and generate more accurate descriptions.In addition,CFFSA is a general method,which can further improve the performance of the model by combining with other self-attention based image captioning methods.

Key words: Image captioning, Self-attention, Cross-scale feature fusion

CLC Number: 

  • TP181
[1]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[2]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[3]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[4]TANG P J,WANG H L,XU K S.Multi-objective Layer-wise Optimization and Multi-level Probability Fus for Image Description Generation Using LSTM[J].Acta Automatica Sinica,2018,44(7):1237-1249.
[5]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[6]HERDADE S,KAPPELER A,BOAKYE K,et al.Image Captioning:Transforming Objects into Words[J].Advances in Neural Information Processing Systems,2019,32:11137-11147.
[7]LI J.Deep Multimodal Attention Learning for Image Captioning[D].Hangzhou:Hangzhou Dianzi University,2020.
[8]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computers,2020,43(9):1624-1640.
[9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[10]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[11]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[12]SPRATLING M W,JOHNSON M H.A feedback model of visualattention[J].Journal of Cognitive Neuroscience,2004,16(2):219-237.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[14]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[15]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[16]KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[17]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[18]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[19]LAVIE A,AGARWAL A.METEOR:An automatic metric for MT evaluation with high levels of correlation with human judgments[C]//Proceedings of the Second Workshop on Statistical Machine Translation.2007:228-231.
[20]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81.
[21]VEDANTAM R,LAWRENCE Z C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[22]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Cham:Springer,2016:382-398.
[23]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations.2015:7-9.
[24]JI J,XU C,ZHANG X,et al.Spatio-temporal memory attention for image captioning[J].IEEE Transactions on Image Proces-sing,2020,29:7615-7628.
[25]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694.
[26]GUO L,LIU J,TANG J,et al.Aligning linguistic words andvisual semantic units for image captioning[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:765-773.
[27]HUANG L,WANG W,CHEN J,et al.Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[28]PAN Y,YAO T,LI Y,et al.X-linear attention networks forimage captioning[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10971-10980.
[1] WU Zi-yi, LI Shao-mei, JIANG Meng-han, ZHANG Jian-peng. Ontology Alignment Method Based on Self-attention [J]. Computer Science, 2022, 49(9): 215-220.
[2] FANG Yi-qiu, ZHANG Zhen-kun, GE Jun-wei. Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning [J]. Computer Science, 2022, 49(8): 70-77.
[3] CHEN Kun-feng, PAN Zhi-song, WANG Jia-bao, SHI Lei, ZHANG Jin. Moderate Clothes-Changing Person Re-identification Based on Bionics of Binocular Summation [J]. Computer Science, 2022, 49(8): 165-171.
[4] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[5] ZHANG Jia-hao, LIU Feng, QI Jia-yin. Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer [J]. Computer Science, 2022, 49(6A): 370-377.
[6] CHEN Zhang-hui, XIONG Yun. Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate [J]. Computer Science, 2022, 49(6): 180-186.
[7] ZHAO Dan-dan, HUANG De-gen, MENG Jia-na, DONG Yu, ZHANG Pan. Chinese Entity Relations Classification Based on BERT-GRU-ATT [J]. Computer Science, 2022, 49(6): 319-325.
[8] HAN Jie, CHEN Jun-fen, LI Yan, ZHAN Ze-cong. Self-supervised Deep Clustering Algorithm Based on Self-attention [J]. Computer Science, 2022, 49(3): 134-143.
[9] FANG Zhong-jun, ZHANG Jing, LI Dong-dong. Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning [J]. Computer Science, 2022, 49(10): 151-158.
[10] HU Yan-li, TONG Tan-qian, ZHANG Xiao-yu, PENG Juan. Self-attention-based BGRU and CNN for Sentiment Analysis [J]. Computer Science, 2022, 49(1): 252-258.
[11] HU De-feng, ZHANG Chen-xi, WANG Shi-tao, ZHAO Qin-pei, LI Jiang-feng. Intelligent Prediction Model of Tool Wear Based on Deep Signal Processing and Stacked-ResGRU [J]. Computer Science, 2021, 48(6): 175-183.
[12] WANG Xi, ZHANG Kai, LI Jun-hui, KONG Fang, ZHANG Yi-tian. Generation of Image Caption of Joint Self-attention and Recurrent Neural Network [J]. Computer Science, 2021, 48(4): 157-163.
[13] ZHOU Xiao-shi, ZHANG Zi-wei, WEN Juan. Natural Language Steganography Based on Neural Machine Translation [J]. Computer Science, 2021, 48(11A): 557-564.
[14] ZHANG Shi-hao, DU Sheng-dong, JIA Zhen, LI Tian-rui. Medical Entity Relation Extraction Based on Deep Neural Network and Self-attention Mechanism [J]. Computer Science, 2021, 48(10): 77-84.
[15] YU Wen-jia, DING Shi-fei. Conditional Generative Adversarial Network Based on Self-attention Mechanism [J]. Computer Science, 2021, 48(1): 241-246.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!