计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220300092-6.doi: 10.11896/jsjkx.220300092

• 图像处理&多媒体技术 • 上一篇    下一篇

基于独立注意力机制的图像检索算法

张舜尧1,2,3, 李华旺1,2,3, 张永合1,3, 王新宇1,3, 丁国鹏1,3   

  1. 1 中国科学院微小卫星创新研究院 上海 201210;
    2 上海科技大学 上海 201210;
    3 中国科学院大学 北京 100094
  • 出版日期:2023-06-10 发布日期:2023-06-12
  • 通讯作者: 李华旺(lihw@microsate.com)
  • 作者简介:(zhangshy4@shanghaitech.edu.cn)

Image Retrieval Based on Independent Attention Mechanism

ZHANG Shunyao1,2,3, LI Huawang1,2,3, ZHANG Yonghe1,3, WANG Xinyu1,3, DING Guopeng1,3   

  1. 1 Innovation Academy for Mircrosatellites of Chinese Academy of Sciences,Shanghai 201210,China;
    2 Shanghai Tech University,Shanghai 201210,China;
    3 University of Chinese Academy of Sciences,Beijing 100094,China
  • Online:2023-06-10 Published:2023-06-12
  • About author:ZHANG Shunyao,born in 1996,postgraduate.His main research interests include content based image retrieval and pose estimation. LI Huawang,born in 1973,Ph.D,professor,Ph.D supervisor.His main research interests include digital signal processing and computer science.

摘要: 近年来,深度学习的方法在基于内容的图像检索领域已经占据主导地位。为了改善主干网络提取出的特征,使得网络能计算出更具区分度的图像描述,提出了一种独立于输入特征的注意力模块ICSA(Independent Channel-wise and Spatial Attention)。该模块与其他的注意力机制的主要区别在于它的注意力权重在输入不同特征时保持一致,传统注意力模块通过对输入特征进行处理得到注意力,因此它的模型更为精简,其参数大小仅有6.7kB,为SENet大小的5.2%和CBAM的2.6%,运行时间与SENet基本一致,为CBAM的14.9%。ICSA的注意力分为通道和空间注意力两部分,分别储存输入特征不同方向上的权重。在Pittsburgh数据集上进行实验,实验结果表明,对于不同的主干网络,在添加了ICSA模块后Recall@1有0.1%~2.4%的提升。

关键词: 基于内容的图像检索, 注意力机制, 特征增强

Abstract: In recent years,deep learning methods has taken a dominant position in the field of content-based image retrieval.To improve features extracted by off-the-shelf backbones and enable the network produce more discriminative image descriptors,the attention module ICSA(independent channel-wise and spatial attention),which is independent with features input into the mo-dule,is proposed.Attention weights of the proposed module keeps the same when input features change,while attention weights are usually computed with input features in other attention mechanisms,which is a main difference between ICSA and other attention modules.This feature also enables the module to be quite small(only 6.7kB,5.2% the size of SENet,2.6% of the size of CBAM) and relatively fast(similar with SENet in speed and 14.9% the time of CBAM).The attention of ICSA is divided as two parts:channel-wise and spatial attention,and they store the weights along orthogonal directions.Experiments on Pittsburgh shows that ICSA made improvement from 0.1% to 2.4% at Recall@1 when with different backbones.

Key words: Content based image retrieval, Attention mechanism, Feature enhancement

中图分类号: 

  • TP391
[1]LEW M S,SEBE N,DJERABA C,et al.Content-based multimedia information retrieval[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2006,2(1):1-19.
[2]SMEULDERS A W M,WORRING M,SANTINI S,et al.Content-based image retrieval at the end of the early years[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(12):1349-1380.
[3]CHANG S K,HSU A.Image information systems:where do we go from here?[J].IEEE transactions on Knowledge and Data Engineering,1992,4(5):431-442.
[4]SIVIC J,ZISSERMAN A.Video Google:A text retrieval ap-proach to object matching in videos[C]//IEEE International Conference on Computer Vision.IEEE Computer Society,2003:1470-1470.
[5]FEI-FEI L,PERONA P.A bayesian hierarchical model forlearning natural scene categories[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05).IEEE,2005:524-531.
[6]LOWE D G.Distinctive image features from scale-invariant keypoints[J].International Journal of Computer Vision,2004,60(2):91-110.
[7]JÉGOU H,DOUZE M,SCHMID C,et al.Aggregating local descriptors into a compact image representation[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.IEEE,2010:3304-3311.
[8]PERRONNIN F,SÁNCHEZ J,MENSINK T.Improving thefisher kernel for large-scale image classification[C]//European Conference on Computer Vision.Springer.2010:143-156.
[9]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,60(6):84-90.
[10]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[11]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[12]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[13]BABENKO A,SLESAREV A,CHIGORIN A,et al.Neuralcodes for image retrieval[C]//European Conference on Computer Vision.Springer.2014:584-599.
[14]LAI H,PAN Y,LIU Y,et al.Simultaneous feature learning and hash coding with deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3270-3278.
[15]NOROUZI M,FLEET D J,SALAKHUTDINOV R R.Hamming distance metric learning[J].Advances in Neural Information Processing Systems,2012,25:1061-1069.
[16]ZHANG R,LIN L,ZHANG R,et al.Bit-scalable deep hashing withregularized similarity learning for image retrieval and person re-identification[J].IEEE Transactions on Image Processing,2015,24(12):4766-4779.
[17]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5297-5307.
[18]ONG E J,HUSAIN S,BOBER M.Siamese network of deep fisher-vector descriptors for image retrieval[J].arXiv:1702.00338,2017.
[19]RADENOVIC′ F,TOLIAS G,CHUM O.CNN image retrieval learns from BoW:Unsupervised fine-tuning with hard examples[C]//European Conference on Computer Vision.Springer.2016:3-20.
[20]BROWN A,XIE W,KALOGEITON V,et al.Smooth-ap:Smoothing the path towards large-scale image retrieval[C]//European Conference on Computer Vision.Springer,2020:677-694.
[21]BABENKO A,LEMPITSKY V.Aggregating local deep features for image retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1269-1277.
[22]KALANTIDIS Y,MELLINA C,OSINDERO S.Cross-dimen-sional weighting for aggregated deep convolutional features[C]//European Conference on Computer Vision.Springer,2016:685-701.
[23]ITTI L,KOCH C,NIEBUR E.A model of saliency-based visual attention for rapid scene analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(11):1254-1259.
[24]MNIH V,HEESS N,GRAVES A.Recurrent models of visual attention[J].Advances in Neural Information Processing Systems,2014,27:2204-2212.
[25]JADERBERG M,SIMONYAN K,ZISSERMAN A.Spatialtransformer networks[J].Advances in Neural Information Processing Systems,2015,28.
[26]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[27]WOO S,PARK J,LEE J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,2:6000-6010.
[29]WANG X,GIRSHICK R,GUPTA A,et al.Non-local neuralnetworks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7794-7803.
[30]GUO M H,XU T X,LIU J J,et al.Attention Mechanisms in Computer Vision:A Survey[J].arXiv:2111.07624,2021.
[31]BALNTAS V,RIBA E,PONSA D,et al.Learning local feature descriptors with triplets and shallow convolutional neural networks[C]//Bmvc.2016.
[32]SUTSKEVER I,MARTENS J,DAHL G,et al.On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning.PMLR.2013:1139-1147.
[33]VAN DER MAATEN L,HINTON G.Visualizing data usingt-SNE[J].Journal of Machine Learning Research,2008,9(11):2579-2605.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!