基于文本引导图像语义融合的跨模态哈希检索

doi:10.11896/jsjkx.221100191

摘要/Abstract

摘要： 基于哈希的跨模态检索算法具有存储消耗低和搜索效率高的特点,跨模态哈希检索在多媒体数据中的应用成为当前的研究热点。目前对于跨模态哈希检索的主流方法是研究模态间哈希码的学习能力,忽视了不同模态之间的特征学习能力以及语义融合能力。将Clip中的图像-文本匹配问题转换为像素-文本匹配问题,文本特征经过Transformer解码器查询图片特征,鼓励文本特征学习到最相关的图片像素级信息,并将像素-文本匹配得分引导图片模态的特征学习,挖掘出不同模态之间的更深层次的相关联的语义信息,并引入二元交叉熵损失函数来提升模态之间的语义融合能力,在高维特征映射到低维的汉明空间时能够得到高质量的二值哈希码。在MIRFLICKR-25K和NUS-WIDE数据集上进行对比实验,实验结果表明所提算法模型在不同长度的哈希码条件下的检索效果均优于目前主流的算法。

关键词: 哈希, Clip, Transformer, 二元交叉熵, 跨模态检索

Abstract: Hash-based cross-modal retrieval algorithm is characterized by low storage consumption and high search efficiency,and the application of cross-modal hash retrieval in multimedia data has become a current research hot-spot.At present,the mainstream method for cross-modal hash retrieval is to study the learning ability of intermodal hash codes,ignoring the feature lear-ning ability and semantic fusion ability between different modes.This paper transforms the image-text matching problem in Clip into pixel-text matching problem,the text features query image features through Transformer decoder,encourage text features to learn the most relevant image pixel level information,and the pixel-text matching score guide image modal feature learning,dig out the deeper related semantic information between different modalities,and introduce binary cross-entropy loss function to improve the semantic fusion ability between modalities.High-quality binary hash codes can be obtained when high-dimensional features are mapped to a low-dimensional Hamming space.Comparative experiments are carried out on MIRFLICKR-25K and NUS-WIDE datasets,and the experimental results show that the present algorithm model performs better than the current mainstream algorithms under hash codes of different lengths.

Key words: Hash, Clip, Transformer, Binary cross-entropy, Cross-modal retrieval

中图分类号:

TP391

顾宝程, 刘立. 基于文本引导图像语义融合的跨模态哈希检索[J]. 计算机科学, 2023, 50(11A): 221100191-6. https://doi.org/10.11896/jsjkx.221100191

GU Baocheng, LIU Li. Cross-modal Hash Retrieval Based on Text-guided Image Semantic Fusion[J]. Computer Science, 2023, 50(11A): 221100191-6. https://doi.org/10.11896/jsjkx.221100191

参考文献

[1]RAO Y,ZHAO W,CHEN G,et al.Denseclip:Language-guided dense prediction with context-aware prompting[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18082-18091.
[2]SU S P,ZHONG Z S,ZHANG C.Deep Joint-Semantics Reconstructing Hashing For Large-Scale Unsupervised Cross-Modal Retrieval[C]//IEEE International Conference on Computer Vision.2019:3027-3035.
[3]HOANG T,DO T T,NGUYEN T V,et al.Unsupervised Deep Cross-modality Spectral Hashing[J].IEEE Transactions on Image Processing,2020,29:8391-8406.
[4]SHEN Z,ZHAI D,LIU X,et al.Semi-Supervised Graph Convolutional Hashing Network For Large-Scale Cross-Modal Retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2366-2370.
[5]SHEN X,ZHANG H,LI L,et al.Semi-supervised cross-modal hashing with multi-view graph representation[J].Information Sciences,2022,604:45-60.
[6]LI C,DENG C,LI N,et al.Self-supervised adversarial hashingnetworks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4242-4251.
[7]BAI C,ZENG C,MA Q,et al.Deep adversarial discrete hashing for cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:525-531.
[8]XU R,LI C,YAN J,et al.Graph Convolutional Network Ha-shing for Cross-Modal Retrieval[C]//IJCAI.2019:982-988.
[9]CONG B,CHAO Z,QING M,et al.Graph Convolutional Network Discrete Hashing for Cross-Modal Retrieval[J/OL].https://doi.org/10.1109/TNNLS.2022.3174970.
[10]LI W,XIONG H,OU W,et al.Semantic Constraints MatrixFactorization Hashing for cross-modal retrieval[J].Computers and Electrical Engineering,2022,100:107842.
[11]WANG D,WANG Q,HE L,et al.Joint and individual matrix factorization hashing for large-scale cross-modal retrieval[J].Pattern Recognition,2020,107:107479.
[12]ASHISH V,NOAM S,NIKI P,et al.Attention Is All You Need[C]//Conference on Neural Information Processing Systems.2017:5998-6008.
[13]BROWN T,MANN B,RYDERN,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems.2020:1877-1901.
[14]ZHANG Z,WU Y,ZHAO H,et al.Semantics-aware BERT for language understanding[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:9628-9635.
[15]LIU Z,LIN Y,CAOY,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[16]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations.2021.
[17]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Conference on Neural Information Processing Systems.2019:13-23.
[18]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[19]TAN W,ZHU L,GUAN W,et al.Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval[C]//Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:982-991.
[20]TU J F,LIU X L,LIN Z X,et al.Differentiable Cross-modalHashing via Multimodal Transformers[C]//ACM International Conference on Multimedia.2022:453-461.
[21]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[22]LUO H,JI L,ZHONG M,et al.CLIP4Clip:An empirical study of CLIP for end to end video clip retrieval and captioning[J].Neurocomputing,2022,508:293-304.
[23]TANG M,WANG Z,LIU Z,et al.Clip4caption:Clip for video caption[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:4858-4862.
[24]WANG X,ZHOU X,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271.
[25]ZOU X,WANG X,BAKKER E M,et al.Multi-Label Semantics Preserving Based Deep Cross-Modal Hashing[J].Signal Processing:Image Communication,2021,93:116131.
[26]GU W,GU X,GU J,et al.Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval[C]//International Conference on Multimedia Retrieval.2019:159-167.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed