计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 191-197.doi: 10.11896/jsjkx.220600009
王鸣展, 冀俊忠, 贾奥哲, 张晓丹
WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan
摘要: 近年来,基于自注意力机制的编码器-解码器框架已经成为主流的图像描述模型。然而,编码器中的自注意力只建模低尺度特征的视觉关系,忽略了高尺度视觉特征中的一些有效信息,从而影响了生成描述的质量。针对该问题,文中提出了一种基于跨尺度特征融合自注意力的图像描述方法。该方法在进行自注意力运算时,将低尺度和高尺度的视觉特征进行跨尺度融合,从视觉角度上提高自注意力关注的范围,增加有效视觉信息,减少噪声,从而学习到更准确的视觉语义关系。在MS COCO数据集上的实验结果表明,所提方法能够更精确地捕获跨尺度视觉特征间的关系,生成更准确的描述。特别地,该方法是一种通用的方法,通过与其他基于自注意力的图像描述方法相结合,能进一步提高模型性能。
中图分类号:
[1]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164. [2]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057. [3]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [4]TANG P J,WANG H L,XU K S.Multi-objective Layer-wise Optimization and Multi-level Probability Fus for Image Description Generation Using LSTM[J].Acta Automatica Sinica,2018,44(7):1237-1249. [5]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699. [6]HERDADE S,KAPPELER A,BOAKYE K,et al.Image Captioning:Transforming Objects into Words[J].Advances in Neural Information Processing Systems,2019,32:11137-11147. [7]LI J.Deep Multimodal Attention Learning for Image Captioning[D].Hangzhou:Hangzhou Dianzi University,2020. [8]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computers,2020,43(9):1624-1640. [9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587. [10]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024. [11]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336. [12]SPRATLING M W,JOHNSON M H.A feedback model of visualattention[J].Journal of Cognitive Neuroscience,2004,16(2):219-237. [13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [14]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99. [15]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755. [16]KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137. [17]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276. [18]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318. [19]LAVIE A,AGARWAL A.METEOR:An automatic metric for MT evaluation with high levels of correlation with human judgments[C]//Proceedings of the Second Workshop on Statistical Machine Translation.2007:228-231. [20]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81. [21]VEDANTAM R,LAWRENCE Z C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575. [22]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Cham:Springer,2016:382-398. [23]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations.2015:7-9. [24]JI J,XU C,ZHANG X,et al.Spatio-temporal memory attention for image captioning[J].IEEE Transactions on Image Proces-sing,2020,29:7615-7628. [25]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694. [26]GUO L,LIU J,TANG J,et al.Aligning linguistic words andvisual semantic units for image captioning[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:765-773. [27]HUANG L,WANG W,CHEN J,et al.Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643. [28]PAN Y,YAO T,LI Y,et al.X-linear attention networks forimage captioning[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10971-10980. |
[1] | 吴子仪, 李邵梅, 姜梦函, 张建朋. 基于自注意力模型的本体对齐方法 Ontology Alignment Method Based on Self-attention 计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190 |
[2] | 方义秋, 张震坤, 葛君伟. 基于自注意力机制和迁移学习的跨领域推荐算法 Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning 计算机科学, 2022, 49(8): 70-77. https://doi.org/10.11896/jsjkx.210600011 |
[3] | 陈坤峰, 潘志松, 王家宝, 施蕾, 张锦. 基于双目叠加仿生的微换衣行人再识别 Moderate Clothes-Changing Person Re-identification Based on Bionics of Binocular Summation 计算机科学, 2022, 49(8): 165-171. https://doi.org/10.11896/jsjkx.210600140 |
[4] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[5] | 张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构 Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer 计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023 |
[6] | 陈章辉, 熊贇. 基于解耦-检索-生成的图像风格化描述生成模型 Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate 计算机科学, 2022, 49(6): 180-186. https://doi.org/10.11896/jsjkx.211100129 |
[7] | 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123 |
[8] | 韩洁, 陈俊芬, 李艳, 湛泽聪. 基于自注意力的自监督深度聚类算法 Self-supervised Deep Clustering Algorithm Based on Self-attention 计算机科学, 2022, 49(3): 134-143. https://doi.org/10.11896/jsjkx.210100001 |
[9] | 方仲俊, 张静, 李冬冬. 基于空间和多层级联合编码的图像描述算法 Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning 计算机科学, 2022, 49(10): 151-158. https://doi.org/10.11896/jsjkx.210900159 |
[10] | 胡艳丽, 童谭骞, 张啸宇, 彭娟. 融入自注意力机制的深度学习情感分析方法 Self-attention-based BGRU and CNN for Sentiment Analysis 计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063 |
[11] | 徐少伟, 秦品乐, 曾建朝, 赵致楷, 高媛, 王丽芳. 基于多级特征和全局上下文的纵膈淋巴结分割算法 Mediastinal Lymph Node Segmentation Algorithm Based on Multi-level Features and Global Context 计算机科学, 2021, 48(6A): 95-100. https://doi.org/10.11896/jsjkx.200700067 |
[12] | 王习, 张凯, 李军辉, 孔芳, 张熠天. 联合自注意力和循环网络的图像标题生成 Generation of Image Caption of Joint Self-attention and Recurrent Neural Network 计算机科学, 2021, 48(4): 157-163. https://doi.org/10.11896/jsjkx.200300146 |
[13] | 周小诗, 张梓葳, 文娟. 基于神经网络机器翻译的自然语言信息隐藏 Natural Language Steganography Based on Neural Machine Translation 计算机科学, 2021, 48(11A): 557-564. https://doi.org/10.11896/jsjkx.210100015 |
[14] | 张世豪, 杜圣东, 贾真, 李天瑞. 基于深度神经网络和自注意力机制的医学实体关系抽取 Medical Entity Relation Extraction Based on Deep Neural Network and Self-attention Mechanism 计算机科学, 2021, 48(10): 77-84. https://doi.org/10.11896/jsjkx.210300271 |
[15] | 赵佳琦, 王瀚正, 周勇, 张迪, 周子渊. 基于多尺度与注意力特征增强的遥感图像描述生成方法 Remote Sensing Image Description Generation Method Based on Attention and Multi-scale Feature Enhancement 计算机科学, 2021, 48(1): 190-196. https://doi.org/10.11896/jsjkx.200600076 |
|