计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 221100191-6.doi: 10.11896/jsjkx.221100191
顾宝程, 刘立
GU Baocheng, LIU Li
摘要: 基于哈希的跨模态检索算法具有存储消耗低和搜索效率高的特点,跨模态哈希检索在多媒体数据中的应用成为当前的研究热点。目前对于跨模态哈希检索的主流方法是研究模态间哈希码的学习能力,忽视了不同模态之间的特征学习能力以及语义融合能力。将Clip中的图像-文本匹配问题转换为像素-文本匹配问题,文本特征经过Transformer解码器查询图片特征,鼓励文本特征学习到最相关的图片像素级信息,并将像素-文本匹配得分引导图片模态的特征学习,挖掘出不同模态之间的更深层次的相关联的语义信息,并引入二元交叉熵损失函数来提升模态之间的语义融合能力,在高维特征映射到低维的汉明空间时能够得到高质量的二值哈希码。在MIRFLICKR-25K和NUS-WIDE数据集上进行对比实验,实验结果表明所提算法模型在不同长度的哈希码条件下的检索效果均优于目前主流的算法。
中图分类号:
[1]RAO Y,ZHAO W,CHEN G,et al.Denseclip:Language-guided dense prediction with context-aware prompting[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18082-18091. [2]SU S P,ZHONG Z S,ZHANG C.Deep Joint-Semantics Reconstructing Hashing For Large-Scale Unsupervised Cross-Modal Retrieval[C]//IEEE International Conference on Computer Vision.2019:3027-3035. [3]HOANG T,DO T T,NGUYEN T V,et al.Unsupervised Deep Cross-modality Spectral Hashing[J].IEEE Transactions on Image Processing,2020,29:8391-8406. [4]SHEN Z,ZHAI D,LIU X,et al.Semi-Supervised Graph Convolutional Hashing Network For Large-Scale Cross-Modal Retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2366-2370. [5]SHEN X,ZHANG H,LI L,et al.Semi-supervised cross-modal hashing with multi-view graph representation[J].Information Sciences,2022,604:45-60. [6]LI C,DENG C,LI N,et al.Self-supervised adversarial hashingnetworks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4242-4251. [7]BAI C,ZENG C,MA Q,et al.Deep adversarial discrete hashing for cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:525-531. [8]XU R,LI C,YAN J,et al.Graph Convolutional Network Ha-shing for Cross-Modal Retrieval[C]//IJCAI.2019:982-988. [9]CONG B,CHAO Z,QING M,et al.Graph Convolutional Network Discrete Hashing for Cross-Modal Retrieval[J/OL].https://doi.org/10.1109/TNNLS.2022.3174970. [10]LI W,XIONG H,OU W,et al.Semantic Constraints MatrixFactorization Hashing for cross-modal retrieval[J].Computers and Electrical Engineering,2022,100:107842. [11]WANG D,WANG Q,HE L,et al.Joint and individual matrix factorization hashing for large-scale cross-modal retrieval[J].Pattern Recognition,2020,107:107479. [12]ASHISH V,NOAM S,NIKI P,et al.Attention Is All You Need[C]//Conference on Neural Information Processing Systems.2017:5998-6008. [13]BROWN T,MANN B,RYDERN,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems.2020:1877-1901. [14]ZHANG Z,WU Y,ZHAO H,et al.Semantics-aware BERT for language understanding[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:9628-9635. [15]LIU Z,LIN Y,CAOY,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022. [16]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations.2021. [17]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Conference on Neural Information Processing Systems.2019:13-23. [18]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344. [19]TAN W,ZHU L,GUAN W,et al.Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval[C]//Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:982-991. [20]TU J F,LIU X L,LIN Z X,et al.Differentiable Cross-modalHashing via Multimodal Transformers[C]//ACM International Conference on Multimedia.2022:453-461. [21]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [22]LUO H,JI L,ZHONG M,et al.CLIP4Clip:An empirical study of CLIP for end to end video clip retrieval and captioning[J].Neurocomputing,2022,508:293-304. [23]TANG M,WANG Z,LIU Z,et al.Clip4caption:Clip for video caption[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:4858-4862. [24]WANG X,ZHOU X,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271. [25]ZOU X,WANG X,BAKKER E M,et al.Multi-Label Semantics Preserving Based Deep Cross-Modal Hashing[J].Signal Processing:Image Communication,2021,93:116131. [26]GU W,GU X,GU J,et al.Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval[C]//International Conference on Multimedia Retrieval.2019:159-167. |
|