基于FLIP与联合相似性保持的跨模态哈希检索

doi:10.11896/jsjkx.240400151

摘要/Abstract

摘要： 最近,监督跨模态检索技术引起了人们的极大关注。然而,目前的工作主要关注样本级别的语义关系来评估样本之间的语义相似性,而忽略了标签分布对提高检索性能的潜在影响。此外,现有方法仍然面临着特征提取结果差和处理速率相对缓慢等相关挑战。为了应对这些问题,文中提出了一种新方法,基于FLIP与联合相似性保持的跨模态哈希检索(FLIP-based Joint Similarity Preserving Hashing for Cross-Modal Retrieval,FJSPH)。具体来说,该方法利用快速语言图像预训练模型(Fast Language Image Pre-training Model,FLIP)来提取更准确的跨模态特征。为了进一步减少跨模态语义差异,文中尝试通过多模态比较学习来增强模态交互并更加细粒度化模态语义表示。此外,使用样本级相似度和聚类级相似度进一步利用不同模态之间的语义相关性。这种方法确保了具有相似语义的样本在汉明空间中更接近,从而产生更加具有区分性的哈希码。在3个跨模态数据集上的实验结果表明,FJSPH方法在跨模态哈希检索中表现出了优异的性能。

关键词: 联合相似性保持, 快速语言图像预训练模型, 跨模态检索, 基于样本的相似性, 基于聚类的相似性

Abstract: Recently,supervised cross-modal retrieval techniques have garnered significant attention.Based on sample-level semantic relationships,existing methods primarily focus on assessing the sample-wise similarity while neglecting the potential impact of label distribution on improving retrieval performance.Furthermore,existing approaches still face challenges related to inaccurate feature extraction and sluggish processing rates.To address this problems,we introduce a new method,termed FLIP-based joint similarity preserving hashing(FJSPH),for cross-modal retrieval.Specifically,we leverage the fast language image pre-training model(FLIP) to extract more accurate cross-modal features.To further reduce the cross-modal semantic differences,we attempt to enhance modal interaction and refine modal semantic representation through multimodal comparative learning.In addition,we use sample-wise similarity and cluster-wise similarity to further exploit the semantic correlation between different modalities.This approach ensures that samples sharing similar semantics are positioned closer together in Hamming space,thereby producing more distinctive hash codes.The experimental results on three cross-modal datasets indicate that the FJSPH approach exhibits excellent retrieval performance in cross-modal retrieval.

Key words: Joint similarity preserving, Fast language-image pre-trained model, Cross-modal retrieval, Sample-wise similarity, Cluster-wise similarity

中图分类号:

TP391

唐立军, 杨政, 赵男, 翟苏巍. 基于FLIP与联合相似性保持的跨模态哈希检索[J]. 计算机科学, 2025, 52(6A): 240400151-10. https://doi.org/10.11896/jsjkx.240400151

TANG Lijun , YANG Zheng, ZHAO Nan, ZHAI Suwei. FLIP-based Joint Similarity Preserving Hashing for Cross-modal Retrieval[J]. Computer Science, 2025, 52(6A): 240400151-10. https://doi.org/10.11896/jsjkx.240400151

参考文献

[1]DING G G,GUO Y C,ZHOU J.Collective matrix factorization hashing for multimodal data[C]//Proceedings IEEE Conf.Comput.Vis.Pattern Recognit..2014:2075-2082.
[2]SU S P,ZHONG Z S,ZHANG C.Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval[C]//Proceedings IEEE/CVF Int.Conf.Comput.Vis..2019:3027-3035.
[3]YANG D J,WU D Y,ZHANG W Q,et al.Deep semantic-alignment hashing for unsupervised cross-modal retrieval[C]//Proceedings 2020 Int.Conf.Multimed.Retr..2020:44-52.
[4]LIU S,QIAN S S,GUAN Y,et al.Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval[C]//Proceedings 43rd Int.ACM SIGIR Conf.Res.Dev.Inf.Retr..2020:1379-1388.
[5]LIN Z J,DING G G,HU M Q,et al.Semantics-preserving hashing for cross-view retrieval[C]//Proceedings IEEE Conf.Comput.Vis.Pattern Recognit..2015:3864-3872.
[6]LI T Y,YANG X C,WANG B,et al.bi-CMR:Bidirectional reinforcement guided hashing for effective cross-modal retrieval[C]//Proceedings AAAI Conf.Artif.Intell..2022:10275-10282.
[7]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision,presented[J].Int.Conf.Mach.Learn..2021:8748-8763.
[8]SAUER A,KARRAS T,LAINE S,et al.Stylegan-t:Unlocking the power of gans for fast large-scale text-to-image synthesis[J].arXiv:2301.09515,2023.
[9]CHEN R N,et al.Clip2scene:Towards label-efficient 3d scene understanding by clip[C]//Proceedings IEEE/CVF Conf.Comput.Vis.Pattern Recognit..2023:7020-7030.
[10]YU W W,LIU Y L,HUA W,et al.Turning a clip model into a scene text detector[C]//Proceedings IEEE/CVF Conf.Comput.Vis.Pattern Recognit..2023:6978-6988.
[11]HE K M,CHEN X L,XIE S N,et al.Masked autoencoders are scalable vision learners[C]//Proceedings IEEE/CVF Conf.Comput.Vis.Pattern Recognit..2022:16000-16009.
[12]LI Y H,FAN H Q,HU R H,et al.Scaling language-image pre-training via masking[C]//Proceedings IEEE/CVF Conf.Comput.Vis.Pattern Recognit..2023:23390-23400.
[13]MANDAL D,CHAUDHURY K N,BISWAS S.Generalized semantic preserving hashing for n-label cross-modal retrieval[C]//Proceedings IEEE Conf.Comput.Vis.Pattern Recognit..2017:4076-4084.
[14]WANG Y X,CHEN Z D,LUO X,et al.High-dimensional sparse cross-modal hashing with fine-grained similarity embedding[C]//Proceedings Web Conf..2021:2900-2909.
[15]KUMAR S,UDUPA R.Learning hash functions for cross-view similarity search,” presented at 22nd Int[J].Joint Conf.Artif.Intell.,2011.
[16]WANG W W,SHEN Y M,ZHANG H F,et al.Set and rebase:determining the semantic graph connectivity for unsupervised cross-modal hashing[C]//Proceedings 29th Int.Joint Conf.Artif.Intell..2021:853-859.
[17]LI X L,HU D,NIE F P.Deep binary reconstruction for cross-modal hashing[C]//Proceedings 25th ACM Int.Conf.Multimedia.2017:1398-1406.
[18]YU J,ZHOU H,ZHAN Y B,et al.Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing[C]//Proceedings AAAI Conf.Artif.Intell..2021:4626-4634.
[19]TU R C,et al.Unsupervised cross-modal hashing via semantic text mining[J].IEEE Trans Multimedia,2023.
[20]XIA X Y,DONG G H,LI F L,et al.When clip meets cross-modal hashing retrieval:A new strong baseline[J].Inf.Fusion,2023,100:101968.
[21]JIN L,LI K,HU H,et al.Semantic neighbor graph hashing for multimodal retrieval[J].IEEE Trans.Image Process.,2017,27(3):1405-1417.
[22]TANG J,WANG K,SHAO L.Supervised matrix factorization hashing for cross-modal retrieval[J].IEEE Trans.Image Process.,2016,25(7):3157-3166.
[23]LIU X,HU Z K,LING H B,et al.Mtfh:A matrix tri-factorization hashing framework for efficient cross-modal retrieval[J].IEEE Trans.Pattern Anal.Mach.Intell.,2019,43(3):964-981.
[24]WANG Y X,LUO X,NIE L Q,et al.Batch:A scalable asymmetric discrete cross-modal hashing[J].IEEE Trans.Knowl.Data Eng.,2020,33(11):3507-3519.
[25]JIANG Q Y,LI W J.Deep cross-modal hashing[C]//Procee-dings IEEE Conf.Comput.Vis.Pattern Recognit..2017:3232-3240.
[26]XIE D,DENG C,LI C,et al.Multi-task consistency-preserving adversarial hashing for cross-modal retrieval[J].IEEE Trans.Image Process.,2020,29:3626-3637.
[27]XU R Q,LI C,YAN J C,et al.Graph convolutional networkhashing for cross-modal retrieval[C]//IJCAI,2019.2019:982-988.
[28]TU R C,MAO X L,MA B,et al.Deep cross-modal hashing with hashing functions and unified hash codes jointly learning[J].IEEE Trans.Knowl.Data Eng.,2020,34(2):560-572.
[29]BAI C,ZENG C,MA Q,et al.Deep adversarial discrete hashing for cross-modal retrieval[C]//Proceedings 2020 Int.Conf.Multimed.Retr..2020:525-531.
[30]ZENG Z X,MAO W J.A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval[J].arXiv:2201.02772,2022.
[31]HUISKES M J,LEW M S.The mir flickr retrieval evaluation[C]//Proceedings 1st ACM Int.Conf.Multimed.Inf.Retr..2008:39-43.
[32]CHUA T S,TANG J H,HONG R C,et al.Nus-wide:a real-world web image database from national university of singapore[C]//Proceedings ACM Int.Conf.Image Video Retr..2009:1-9.
[33]LIN T Y.Microsoft coco:Common objects in context[C]//Computer Vision－ECCV 2014:13th European Conference,Zurich,Switzerland,September 6-12,2014,Proceedings,Part V 13.Springer,2014:740-755.
[34]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[35]ZHENG C Q,ZHU L,CHENG Z Y,et al.Adaptive partialmulti-view hashing for efficient social image retrieval[J].IEEE Trans.Multimedia,2020,23:4079-4092.
[36]ZHANG D L,WU X J,YU J.Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval[J].ACM Trans.Multimed.Comput.Commun.Appl.,2021,17(3):1-18.
[37]LUO K Y,ZHANG C,LI H X,et al.Adaptive marginalized semantic hashing for unpaired cross-modal retrieval[J].IEEE Trans.Multimedia,2023.
[38]CHEN Y,ZHANG H,TIAN Z B,et al.Enhanced discretemulti-modal hashing:More constraints yet less time to learn[J].IEEE Trans.Knowl.Data Eng.,2020,34(3):1177-1190.
[39]HU Z K,CHEUNG Y M,LI M K,et al.Joint semantic preserving sparse hashing for cross-modal retrieval[J].IEEE Trans.Circuits Syst.Video Technol.,2023.
[40]LI C,DENG C,LI N,et al.Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings IEEE Conf.Comput.Vis.Pattern Recognit..2018:4242-4251.
[41]ZHANG Z,LUO H Y,ZHU L,et al.Modality-invariant asymmetric networks for cross-modal hashing[J].IEEE Trans.Knowl.Data Eng.,2022,35(5):5091-5104,.
[42]YU E,MA J H,SUN J D,et al.Deep discrete cross-modal hashing with multiple supervision[J].Neurocomputing,2022,486:215-224.
[43]LI X,YU J,LU H C,et al.Mafh:Multilabel aware framework for bit-scalable cross-modal hashing[J].Knowl.Based Syst.,2023,279:110922.
[44]HINTON G E,SRIVASTAVA N,KRIZHEVSKY A,et al.Improving neural networks by preventing co-adaptation of feature detectors[J].arXiv:1207,0580,2012.
[45]KO Y.A study of term weighting schemes using class information for text classification[C]//Proceedings 35th Int.ACM SIGIR Conf.Res.Dev.Inf.Retr..2012:1029-1030.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed