计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 191-197.doi: 10.11896/jsjkx.220600009

• 计算机图形学&多媒体 • 上一篇    下一篇

基于跨尺度特征融合自注意力的图像描述方法

王鸣展, 冀俊忠, 贾奥哲, 张晓丹   

  1. 北京工业大学信息学部计算机学院 北京 100124
    北京工业大学北京人工智能研究院 北京 100124
  • 收稿日期:2022-06-01 修回日期:2022-07-05 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 张晓丹(zhangxiaodan@bjut.edu.cn)
  • 作者简介:(erictyloo@163.com)
  • 基金资助:
    国家自然科学基金(61906007);北京市教育委员会科学研究计划(KM202110005022,KZ202210005009)

Cross-scale Feature Fusion Self-attention for Image Captioning

WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan   

  1. School of Computer Science,Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China
    Beijing Institute of Artificial Intelligence,Beijing University of Technology,Beijing 100124,China
  • Received:2022-06-01 Revised:2022-07-05 Online:2022-10-15 Published:2022-10-13
  • About author:WANG Ming-zhan,born in 1997,master.His main research interests include image captioning and so on.
    ZHANG Xiao-dan,born in 1987,Ph.D.Her main research interests include image captioning,image processing,computer vision and natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61906007) and R & D Program of Beijing Municipal Education Commission(KM202110005022,KZ202210005009).

摘要: 近年来,基于自注意力机制的编码器-解码器框架已经成为主流的图像描述模型。然而,编码器中的自注意力只建模低尺度特征的视觉关系,忽略了高尺度视觉特征中的一些有效信息,从而影响了生成描述的质量。针对该问题,文中提出了一种基于跨尺度特征融合自注意力的图像描述方法。该方法在进行自注意力运算时,将低尺度和高尺度的视觉特征进行跨尺度融合,从视觉角度上提高自注意力关注的范围,增加有效视觉信息,减少噪声,从而学习到更准确的视觉语义关系。在MS COCO数据集上的实验结果表明,所提方法能够更精确地捕获跨尺度视觉特征间的关系,生成更准确的描述。特别地,该方法是一种通用的方法,通过与其他基于自注意力的图像描述方法相结合,能进一步提高模型性能。

关键词: 图像描述, 自注意力, 跨尺度特征融合

Abstract: In recent years,the encoder-decoder framework based on self-attention mechanism has become the mainstream model in image captioning.However,self-attention in the encoder only models the visual relations of low-scale features,ignoring some effective information in high-scale visual features,thus affecting the quality of the generated descriptions.To solve this problem,this paper proposes a cross-scale feature fusion self-attention(CFFSA) method for image captioning.Specifically,CFFSA integrates low-scale and high-scale visual features in self-attention to improve the range of attention from a visual perspective,which increases effective visual information and reduces noise,thereby learning more accurate visual and semantic relationships.Experiments on MS COCO dataset show that the proposed method can more accurately capture the relationship between cross-scale visual features and generate more accurate descriptions.In addition,CFFSA is a general method,which can further improve the performance of the model by combining with other self-attention based image captioning methods.

Key words: Image captioning, Self-attention, Cross-scale feature fusion

中图分类号: 

  • TP181
[1]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[2]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[3]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[4]TANG P J,WANG H L,XU K S.Multi-objective Layer-wise Optimization and Multi-level Probability Fus for Image Description Generation Using LSTM[J].Acta Automatica Sinica,2018,44(7):1237-1249.
[5]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[6]HERDADE S,KAPPELER A,BOAKYE K,et al.Image Captioning:Transforming Objects into Words[J].Advances in Neural Information Processing Systems,2019,32:11137-11147.
[7]LI J.Deep Multimodal Attention Learning for Image Captioning[D].Hangzhou:Hangzhou Dianzi University,2020.
[8]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computers,2020,43(9):1624-1640.
[9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[10]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[11]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[12]SPRATLING M W,JOHNSON M H.A feedback model of visualattention[J].Journal of Cognitive Neuroscience,2004,16(2):219-237.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[14]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[15]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[16]KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[17]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[18]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[19]LAVIE A,AGARWAL A.METEOR:An automatic metric for MT evaluation with high levels of correlation with human judgments[C]//Proceedings of the Second Workshop on Statistical Machine Translation.2007:228-231.
[20]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81.
[21]VEDANTAM R,LAWRENCE Z C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[22]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Cham:Springer,2016:382-398.
[23]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations.2015:7-9.
[24]JI J,XU C,ZHANG X,et al.Spatio-temporal memory attention for image captioning[J].IEEE Transactions on Image Proces-sing,2020,29:7615-7628.
[25]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694.
[26]GUO L,LIU J,TANG J,et al.Aligning linguistic words andvisual semantic units for image captioning[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:765-773.
[27]HUANG L,WANG W,CHEN J,et al.Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[28]PAN Y,YAO T,LI Y,et al.X-linear attention networks forimage captioning[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10971-10980.
[1] 吴子仪, 李邵梅, 姜梦函, 张建朋.
基于自注意力模型的本体对齐方法
Ontology Alignment Method Based on Self-attention
计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190
[2] 方义秋, 张震坤, 葛君伟.
基于自注意力机制和迁移学习的跨领域推荐算法
Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning
计算机科学, 2022, 49(8): 70-77. https://doi.org/10.11896/jsjkx.210600011
[3] 陈坤峰, 潘志松, 王家宝, 施蕾, 张锦.
基于双目叠加仿生的微换衣行人再识别
Moderate Clothes-Changing Person Re-identification Based on Bionics of Binocular Summation
计算机科学, 2022, 49(8): 165-171. https://doi.org/10.11896/jsjkx.210600140
[4] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[5] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
[6] 陈章辉, 熊贇.
基于解耦-检索-生成的图像风格化描述生成模型
Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate
计算机科学, 2022, 49(6): 180-186. https://doi.org/10.11896/jsjkx.211100129
[7] 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀.
基于BERT-GRU-ATT模型的中文实体关系分类
Chinese Entity Relations Classification Based on BERT-GRU-ATT
计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123
[8] 韩洁, 陈俊芬, 李艳, 湛泽聪.
基于自注意力的自监督深度聚类算法
Self-supervised Deep Clustering Algorithm Based on Self-attention
计算机科学, 2022, 49(3): 134-143. https://doi.org/10.11896/jsjkx.210100001
[9] 方仲俊, 张静, 李冬冬.
基于空间和多层级联合编码的图像描述算法
Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning
计算机科学, 2022, 49(10): 151-158. https://doi.org/10.11896/jsjkx.210900159
[10] 胡艳丽, 童谭骞, 张啸宇, 彭娟.
融入自注意力机制的深度学习情感分析方法
Self-attention-based BGRU and CNN for Sentiment Analysis
计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063
[11] 徐少伟, 秦品乐, 曾建朝, 赵致楷, 高媛, 王丽芳.
基于多级特征和全局上下文的纵膈淋巴结分割算法
Mediastinal Lymph Node Segmentation Algorithm Based on Multi-level Features and Global Context
计算机科学, 2021, 48(6A): 95-100. https://doi.org/10.11896/jsjkx.200700067
[12] 王习, 张凯, 李军辉, 孔芳, 张熠天.
联合自注意力和循环网络的图像标题生成
Generation of Image Caption of Joint Self-attention and Recurrent Neural Network
计算机科学, 2021, 48(4): 157-163. https://doi.org/10.11896/jsjkx.200300146
[13] 周小诗, 张梓葳, 文娟.
基于神经网络机器翻译的自然语言信息隐藏
Natural Language Steganography Based on Neural Machine Translation
计算机科学, 2021, 48(11A): 557-564. https://doi.org/10.11896/jsjkx.210100015
[14] 张世豪, 杜圣东, 贾真, 李天瑞.
基于深度神经网络和自注意力机制的医学实体关系抽取
Medical Entity Relation Extraction Based on Deep Neural Network and Self-attention Mechanism
计算机科学, 2021, 48(10): 77-84. https://doi.org/10.11896/jsjkx.210300271
[15] 赵佳琦, 王瀚正, 周勇, 张迪, 周子渊.
基于多尺度与注意力特征增强的遥感图像描述生成方法
Remote Sensing Image Description Generation Method Based on Attention and Multi-scale Feature Enhancement
计算机科学, 2021, 48(1): 190-196. https://doi.org/10.11896/jsjkx.200600076
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!