计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 204-213.doi: 10.11896/jsjkx.240600057

• 计算机图形学&多媒体 • 上一篇    下一篇

基于混合注意力与偏振非对称损失的哈希图像检索

刘华咏, 徐明慧   

  1. 人工智能与智慧学习湖北省重点实验室 武汉 430079
    华中师范大学计算机学院 武汉 430079
  • 收稿日期:2024-06-06 修回日期:2024-09-15 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 徐明慧(1138589701@qq.com)
  • 作者简介:(lhywuhee@ccnu.edu.cn)
  • 基金资助:
    教育部人文社会科学研究项目(21YJA870005)

Hash Image Retrieval Based on Mixed Attention and Polarization Asymmetric Loss

LIU Huayong, XU Minghui   

  1. Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning,Wuhan 430079,China
    School of Computer Science,Central China Normal University,Wuhan 430079,China
  • Received:2024-06-06 Revised:2024-09-15 Online:2025-08-15 Published:2025-08-08
  • About author:LIU Huayong,born in 1978,Ph.D,is a member of CCF(No.35656M).His main research interests include cross modal retrieval,computer vision and deep learning.
    XU Minghui,born in 1998,postgra-duate.Her main research interest is fast image retrieval based on deep learning.
  • Supported by:
    Humanities and Social Sciences of China MOE(21YJA870005).

摘要: 随着互联网的不断发展,人们每天都在制造大量且复杂的图像数据,使当今主流的社交媒体充满了图像等媒体数据,快速且准确地对图像进行检索已经成为了有意义且亟待解决的问题。卷积神经网络(CNN)模型是现有的主流哈希图像检索模型。然而,CNN的卷积操作只能捕捉局部特征,无法处理全局信息;且卷积操作的感受野大小固定,无法适应不同尺度的输入图像。为此,基于Transformer模型中的Swin-Transformer模型实现了图像的有效检索。Transformer模型利用自注意力机制和位置编码操作,有效地解决了CNN的问题。而现有的Swin-Transformer哈希图像检索模型的窗口注意力模块在提取图像特征时对于图像的不同通道给予了相同的权重,忽略了图像不同通道特征信息的差异性和依赖关系,使得提取的特征的可利用性降低,造成了计算资源的浪费。针对上述问题,提出了基于混合注意力与偏振非对称损失的哈希图像检索模型(HRMPA)。该设计基于Swin-Transformer的哈希特征提取模块(HFST),在HFST中的(S)W-MSA模块加入了通道注意力模块(CAB),得到基于混合注意力的哈希特征提取模块(HFMA),从而使模型对输入图像的不同通道的特征赋予不同的权重信息,增加了提取特征的多样性且最大限度地利用了计算资源。同时,为了最小化类内汉明距离、最大化类间汉明距离,并充分利用数据的监督信息,提高图像的检索精度,提出了偏振非对称损失函数(PA),使偏振损失和非对称损失以一定的权重分配比进行组合,从而有效地提高了图像的检索精度。实验表明,在哈希编码长度为16 bits时,所提模型在CIFAR-10单标签数据集上,最高平均精度均值达到98.73%,比VTS16-CSQ模型提高了1.51%;在NUSWIDE多标签数据集上,最高平均精度均值达到90.65%,比TransHash提高了18.02%,比VTS16-CSQ模型提高了5.92%。

关键词: 哈希检索, 空间注意力, Swin-Transformer, 混合注意力, 偏振损失, 非对称损失

Abstract: With the continuous development of the Internet,massive and complex image data is being created every day,so that today's mainstream social media is full of complex media data such as images.Effectively processing these image data can not only increase the utilization rate of image data but also improve the user experience.Therefore,how to retrieve images quickly and accurately has become a meaningful and urgent problem.The current mainstream hash image retrieval model is convolutional neural network model.However,the convolution operation of CNN can only capture local features,but cannot process global information,and the receptive field size of the convolution operation is fixed,it cannot adapt to input images of different scales.This paper proposes based on Swin Transformer model in Transformer model to achieve effective image retrieval.The Transformer model effectively solves the CNN problem with self-attention mechanism and location coding operation.However,the window attention module of the existing Swin-Transformer hashing image retrieval model gives the same weight to different channels of the image when extracting image features,thus ignoring the differences and dependencies of the feature information of different channels of the image,which reduces the availability of the extracted features and leads to a waste of computing resources.To solve these problems,this paper proposes hash image retrieval model based on mixed attention and polarization asymmetric loss.The model design is based on Swin-Transformer feature extraction module.The window self-attention module in HFST has been added to the channel attention block.The hash feature extraction module based on mixed attention is obtained,which enables the model to assign different weight information to the features of different channels of the input image.Increase the diversity of extracted features and maximize the use of computing resources.At the same time,in order to minimize the intra-class Hamming distance,maximize the inter-class Hamming distance,make full use of the supervision information of the data,and improve the retrieval accuracy of the image,this paper proposes polarization asymmetric loss function.The polarization loss and asymmetric loss are combined with a certain weight allocation ratio,so effectively improve the image retrieval precision.The experimental results show the validity and rationality of the proposed method.For example,when the hash coding length is 16 bits,the proposed model has a maximum average accuracy of 98.73% on the CIFAR-10 single-label dataset,which is 1.51% higher than that of the VTS16-CSQ model.The highest average retrieval accuracy mean is 90.65% on NUSWIDE multi-label dataset,which is 18.02% higher than TransHash and 5.92% higher than VTS16-CSQ model.

Key words: Hash search, Spatial attention, Swin-Transformer, Mixed attention, Polarization loss, Asymmetric loss

中图分类号: 

  • TP391.41
[1]ZHANG X Y,ZOU J H,HE K,et al.Accelerating very deep convolutional networks for classification and detection[J].IEEETransactions on Pattern Analysis and Machine Intelligence,2016,38(10):1943-55.
[2]LIU F X,ZHAO W B,WANG Z W,et al.IM3A:boosting deep neural network efficiency via in-memory addressing-assisted acceleration[C]//Proceedings of the 2021 on Great Lakes Symposium on VLSI.New York:ACM,2021:253-258.
[3]JIANG Q Y,LI W J.Asymmetric deep supervised hashing[C]//Proceedings of the 32th AAAI Conference on Artificial Intelligence.Menlo Park:AAAI,2018:3342-3349.
[4]SU S P,ZHANG C,HAN K,et al.Greedy hash:Towards fast optimization for accurate hash coding in CNN [C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2018:806-815.
[5]CAO Y,LONG M S,LIU B,et al.Deep cauchy hashing for hamming space retrieval[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018:1229-1237.
[6]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[7]CHEN Y B,ZHANG S,LIU F X,et al.TransHash:transfor-mer-based hamming hashing for efficient image retrieval[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval.New York:ACM,2022:127-136.
[8]LIU Z,LIN Y T,CAO Y,et al.Swin Transformer:hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway,NJ:IEEE Computer Society,2021:10012-10022.
[9]MIAO Z,ZHAO X X,LI Y,et al.Deep supervised hash image retrieval method based on Swin Transformer[J].Journal of Hunan University(Natural Science Edition),2023,50(8):62-71.
[10]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,60(6):84-90.
[11]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[13]WANG W,YANG Y,WANG X,et al.Development of convolutional neural network and its application in image classification:A survey[J].Optical Engineering,2019,58(4):1.
[14]GKOUNTAKOS K,SEMERTZIDIS T,PAPADOPOULOS GT,et al.A reliability object layer for deep hashing-based visual indexing[C]//International Conference on MultiMedia Mode-ling.Cham:Springer,2019:132-143.
[15]LIONG V E,LU J W,WANG G,et al.Deep hashing for Compact binary codes learning[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2015:2475-2483.
[16]ZHU H,LONG M S,WANG J M,et al. Deep hashing network for efficient similarity retrieval[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016:2415-2421.
[17]LIU H M,WANG R P,SHAN S G,et al.Deep supervised hashing for fast image retrieval[J].International Journal of Computer Vision,2019,127(9):1217-1234.
[18]CHENG S L,LAI H C,WANG L J,et al.A novel deep hashing method for fast image retrieval[J].The Visual Computer,2019,35(9):1255-1266.
[19]FENG X J,CHENG Y W.Image retrieval based on deep convolutional neural networks and hash[J].Computer Engineering and Design,2020,41(3):670-675.
[20]SHI L Q,WANG Y M.RAN and deep hashing for image retrieval[J].Electronic Design Engineering,2021,29(6):99-103,110.
[21]ZHANG C Y,ZHU L,ZHANG S C,et al.TDHPPIR:an efficient deep hashing based privacy-preserving image retrieval method[J].Neurocomputing,2020,406:386-398.
[22]ZHANG W Q,WU D Y,ZHOU Y,et al.Binary neural network hashing for image retrieval[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2021:1318-1327.
[23]YANG W J,WANG L J,CHENG S L,et al.Deep hash with improved dual attention for image retrieval[J].Information,2021,12(7):285-285.
[24]WANGX Y.Research on image retrieval method based on deep hash[D].Taiyuan:North University of China,2023.
[25]WANG W H,XIE E Z,LI X,et al.Pyramid vision transformer:a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE & CVF International Conference on Computer Vision.Piscataway,NJ:IEEE,2021:568-578.
[26]LI T,ZHANG Z,PEI L S,et al.HashFormer:vision transformerbased deep hashing for image retrieval[J].IEEE Signal Processing Letters,2022,29:827-831.
[27]DUBEY S R,SINGH S K,CHU W T.Vision transformer hashing for image retrieval[C]//2022 IEEE International Confe-rence on Multimedia and Expo(ICME).IEEE,2022:1-6.
[28]HE C,WEI H X.Image retrieval based on transformer andasymmetric learning strategy[J].Journal of Image and Graphi-cs,2023,28(2):535-544.
[29]LI K C,WANG Y L,ZHANG J H,et al.Uniformer:unifying convolution and self-attention for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(10):12581-12600.
[30]ZHANG Y L,LI K P,LI K,et al.Image super-resolution using very deep residual channel attention networks[C]//Proceedings of the European Conference on Computer Vision.2018:286-301.
[31]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,2:6000-6010.
[32]FAN L X,NG K W,JU C,et al.Deep Polarized Network for Supervised Learning of Accurate Binary Hashing Codes[C]//Proceedings of the 2020 International Joint Conference on Artificial Intelligence(IJCAI).2020:825-831.
[33]KRIZHEVSKY A,HINTON G.Learning multiple layers of features from tiny images[D].Toronto:University of Toronto,2009.
[34]CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from national university of singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval.New York:ACM,2009:1-9.
[35]CAO Z,LONG M,WANG J,et al.HashNet:deep learning to hash by continuation[C]//Proceedings of the IEEE InternationalConference on Computer Vision.IEEE,2017:5609-5618.
[36]XIE Y Z,WEI R K,SONG J K,et al.Label-affinity self-adaptive central similarity hashing for image retrieval[J].IEEE Transactions on Multimedia,2023,25:9161-9174.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!