计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 221100191-6.doi: 10.11896/jsjkx.221100191

• 图像处理&多媒体技术 • 上一篇    下一篇

基于文本引导图像语义融合的跨模态哈希检索

顾宝程, 刘立   

  1. 南华大学计算机学院 湖南 衡阳 421001
  • 发布日期:2023-11-09
  • 通讯作者: 刘立(liuleelap@163.com)
  • 作者简介:(1254427045@qq.com)

Cross-modal Hash Retrieval Based on Text-guided Image Semantic Fusion

GU Baocheng, LIU Li   

  1. School of Computing,University of South China,Hengyang,Hunan 421001,China
  • Published:2023-11-09
  • About author:GU Baocheng,born in 1994,postgra-duate.His main research interests include computer vision and cross-modal retrieval.
    LIU Li,born in 1971,Ph.D,professor.His main research interests include di-gital image processing and embedded,etc.

摘要: 基于哈希的跨模态检索算法具有存储消耗低和搜索效率高的特点,跨模态哈希检索在多媒体数据中的应用成为当前的研究热点。目前对于跨模态哈希检索的主流方法是研究模态间哈希码的学习能力,忽视了不同模态之间的特征学习能力以及语义融合能力。将Clip中的图像-文本匹配问题转换为像素-文本匹配问题,文本特征经过Transformer解码器查询图片特征,鼓励文本特征学习到最相关的图片像素级信息,并将像素-文本匹配得分引导图片模态的特征学习,挖掘出不同模态之间的更深层次的相关联的语义信息,并引入二元交叉熵损失函数来提升模态之间的语义融合能力,在高维特征映射到低维的汉明空间时能够得到高质量的二值哈希码。在MIRFLICKR-25K和NUS-WIDE数据集上进行对比实验,实验结果表明所提算法模型在不同长度的哈希码条件下的检索效果均优于目前主流的算法。

关键词: 哈希, Clip, Transformer, 二元交叉熵, 跨模态检索

Abstract: Hash-based cross-modal retrieval algorithm is characterized by low storage consumption and high search efficiency,and the application of cross-modal hash retrieval in multimedia data has become a current research hot-spot.At present,the mainstream method for cross-modal hash retrieval is to study the learning ability of intermodal hash codes,ignoring the feature lear-ning ability and semantic fusion ability between different modes.This paper transforms the image-text matching problem in Clip into pixel-text matching problem,the text features query image features through Transformer decoder,encourage text features to learn the most relevant image pixel level information,and the pixel-text matching score guide image modal feature learning,dig out the deeper related semantic information between different modalities,and introduce binary cross-entropy loss function to improve the semantic fusion ability between modalities.High-quality binary hash codes can be obtained when high-dimensional features are mapped to a low-dimensional Hamming space.Comparative experiments are carried out on MIRFLICKR-25K and NUS-WIDE datasets,and the experimental results show that the present algorithm model performs better than the current mainstream algorithms under hash codes of different lengths.

Key words: Hash, Clip, Transformer, Binary cross-entropy, Cross-modal retrieval

中图分类号: 

  • TP391
[1]RAO Y,ZHAO W,CHEN G,et al.Denseclip:Language-guided dense prediction with context-aware prompting[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18082-18091.
[2]SU S P,ZHONG Z S,ZHANG C.Deep Joint-Semantics Reconstructing Hashing For Large-Scale Unsupervised Cross-Modal Retrieval[C]//IEEE International Conference on Computer Vision.2019:3027-3035.
[3]HOANG T,DO T T,NGUYEN T V,et al.Unsupervised Deep Cross-modality Spectral Hashing[J].IEEE Transactions on Image Processing,2020,29:8391-8406.
[4]SHEN Z,ZHAI D,LIU X,et al.Semi-Supervised Graph Convolutional Hashing Network For Large-Scale Cross-Modal Retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2366-2370.
[5]SHEN X,ZHANG H,LI L,et al.Semi-supervised cross-modal hashing with multi-view graph representation[J].Information Sciences,2022,604:45-60.
[6]LI C,DENG C,LI N,et al.Self-supervised adversarial hashingnetworks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4242-4251.
[7]BAI C,ZENG C,MA Q,et al.Deep adversarial discrete hashing for cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:525-531.
[8]XU R,LI C,YAN J,et al.Graph Convolutional Network Ha-shing for Cross-Modal Retrieval[C]//IJCAI.2019:982-988.
[9]CONG B,CHAO Z,QING M,et al.Graph Convolutional Network Discrete Hashing for Cross-Modal Retrieval[J/OL].https://doi.org/10.1109/TNNLS.2022.3174970.
[10]LI W,XIONG H,OU W,et al.Semantic Constraints MatrixFactorization Hashing for cross-modal retrieval[J].Computers and Electrical Engineering,2022,100:107842.
[11]WANG D,WANG Q,HE L,et al.Joint and individual matrix factorization hashing for large-scale cross-modal retrieval[J].Pattern Recognition,2020,107:107479.
[12]ASHISH V,NOAM S,NIKI P,et al.Attention Is All You Need[C]//Conference on Neural Information Processing Systems.2017:5998-6008.
[13]BROWN T,MANN B,RYDERN,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems.2020:1877-1901.
[14]ZHANG Z,WU Y,ZHAO H,et al.Semantics-aware BERT for language understanding[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:9628-9635.
[15]LIU Z,LIN Y,CAOY,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[16]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations.2021.
[17]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Conference on Neural Information Processing Systems.2019:13-23.
[18]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[19]TAN W,ZHU L,GUAN W,et al.Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval[C]//Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:982-991.
[20]TU J F,LIU X L,LIN Z X,et al.Differentiable Cross-modalHashing via Multimodal Transformers[C]//ACM International Conference on Multimedia.2022:453-461.
[21]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[22]LUO H,JI L,ZHONG M,et al.CLIP4Clip:An empirical study of CLIP for end to end video clip retrieval and captioning[J].Neurocomputing,2022,508:293-304.
[23]TANG M,WANG Z,LIU Z,et al.Clip4caption:Clip for video caption[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:4858-4862.
[24]WANG X,ZHOU X,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271.
[25]ZOU X,WANG X,BAKKER E M,et al.Multi-Label Semantics Preserving Based Deep Cross-Modal Hashing[J].Signal Processing:Image Communication,2021,93:116131.
[26]GU W,GU X,GU J,et al.Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval[C]//International Conference on Multimedia Retrieval.2019:159-167.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!