基于文本引导图像语义融合的跨模态哈希检索

doi:10.11896/jsjkx.221100191

Abstract

Abstract: Hash-based cross-modal retrieval algorithm is characterized by low storage consumption and high search efficiency,and the application of cross-modal hash retrieval in multimedia data has become a current research hot-spot.At present,the mainstream method for cross-modal hash retrieval is to study the learning ability of intermodal hash codes,ignoring the feature lear-ning ability and semantic fusion ability between different modes.This paper transforms the image-text matching problem in Clip into pixel-text matching problem,the text features query image features through Transformer decoder,encourage text features to learn the most relevant image pixel level information,and the pixel-text matching score guide image modal feature learning,dig out the deeper related semantic information between different modalities,and introduce binary cross-entropy loss function to improve the semantic fusion ability between modalities.High-quality binary hash codes can be obtained when high-dimensional features are mapped to a low-dimensional Hamming space.Comparative experiments are carried out on MIRFLICKR-25K and NUS-WIDE datasets,and the experimental results show that the present algorithm model performs better than the current mainstream algorithms under hash codes of different lengths.

Key words: Hash, Clip, Transformer, Binary cross-entropy, Cross-modal retrieval

CLC Number:

TP391

GU Baocheng, LIU Li. Cross-modal Hash Retrieval Based on Text-guided Image Semantic Fusion[J].Computer Science, 2023, 50(11A): 221100191-6.

References

[1]RAO Y,ZHAO W,CHEN G,et al.Denseclip:Language-guided dense prediction with context-aware prompting[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18082-18091.
[2]SU S P,ZHONG Z S,ZHANG C.Deep Joint-Semantics Reconstructing Hashing For Large-Scale Unsupervised Cross-Modal Retrieval[C]//IEEE International Conference on Computer Vision.2019:3027-3035.
[3]HOANG T,DO T T,NGUYEN T V,et al.Unsupervised Deep Cross-modality Spectral Hashing[J].IEEE Transactions on Image Processing,2020,29:8391-8406.
[4]SHEN Z,ZHAI D,LIU X,et al.Semi-Supervised Graph Convolutional Hashing Network For Large-Scale Cross-Modal Retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2366-2370.
[5]SHEN X,ZHANG H,LI L,et al.Semi-supervised cross-modal hashing with multi-view graph representation[J].Information Sciences,2022,604:45-60.
[6]LI C,DENG C,LI N,et al.Self-supervised adversarial hashingnetworks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4242-4251.
[7]BAI C,ZENG C,MA Q,et al.Deep adversarial discrete hashing for cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:525-531.
[8]XU R,LI C,YAN J,et al.Graph Convolutional Network Ha-shing for Cross-Modal Retrieval[C]//IJCAI.2019:982-988.
[9]CONG B,CHAO Z,QING M,et al.Graph Convolutional Network Discrete Hashing for Cross-Modal Retrieval[J/OL].https://doi.org/10.1109/TNNLS.2022.3174970.
[10]LI W,XIONG H,OU W,et al.Semantic Constraints MatrixFactorization Hashing for cross-modal retrieval[J].Computers and Electrical Engineering,2022,100:107842.
[11]WANG D,WANG Q,HE L,et al.Joint and individual matrix factorization hashing for large-scale cross-modal retrieval[J].Pattern Recognition,2020,107:107479.
[12]ASHISH V,NOAM S,NIKI P,et al.Attention Is All You Need[C]//Conference on Neural Information Processing Systems.2017:5998-6008.
[13]BROWN T,MANN B,RYDERN,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems.2020:1877-1901.
[14]ZHANG Z,WU Y,ZHAO H,et al.Semantics-aware BERT for language understanding[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:9628-9635.
[15]LIU Z,LIN Y,CAOY,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[16]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations.2021.
[17]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Conference on Neural Information Processing Systems.2019:13-23.
[18]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[19]TAN W,ZHU L,GUAN W,et al.Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval[C]//Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:982-991.
[20]TU J F,LIU X L,LIN Z X,et al.Differentiable Cross-modalHashing via Multimodal Transformers[C]//ACM International Conference on Multimedia.2022:453-461.
[21]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[22]LUO H,JI L,ZHONG M,et al.CLIP4Clip:An empirical study of CLIP for end to end video clip retrieval and captioning[J].Neurocomputing,2022,508:293-304.
[23]TANG M,WANG Z,LIU Z,et al.Clip4caption:Clip for video caption[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:4858-4862.
[24]WANG X,ZHOU X,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271.
[25]ZOU X,WANG X,BAKKER E M,et al.Multi-Label Semantics Preserving Based Deep Cross-Modal Hashing[J].Signal Processing:Image Communication,2021,93:116131.
[26]GU W,GU X,GU J,et al.Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval[C]//International Conference on Multimedia Retrieval.2019:159-167.

Related Articles 15

[1]	HUANG Hanqiang, XING Yunbing, SHEN Jianfei, FAN Feiyi. Sign Language Animation Splicing Model Based on LpTransformer Network [J]. Computer Science, 2023, 50(9): 184-191.
[2]	TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[3]	ZHU Yuying, GUO Yan, WAN Yizhao, TIAN Kai. New Word Detection Based on Branch Entropy-Segmentation Probability Model [J]. Computer Science, 2023, 50(7): 221-228.
[4]	BAI Zhengyao, FAN Shenglan, LU Qianjie, ZHOU Xue. COVID-19 Instance Segmentation and Classification Network Based on CT Image Semantics [J]. Computer Science, 2023, 50(6A): 220600142-9.
[5]	YANG Jingyi, LI Fang, KANG Xiaodong, WANG Xiaotian, LIU Hanqing, HAN Junling. Ultrasonic Image Segmentation Based on SegFormer [J]. Computer Science, 2023, 50(6A): 220400273-6.
[6]	ZHANG Changfan, MA Yuanyuan, LIU Jianhua, HE Jing. Dual Gating-Residual Feature Fusion for Image-Text Cross-modal Retrieval [J]. Computer Science, 2023, 50(6A): 220700030-7.
[7]	QIN Liang, XIE Liang, CHEN Shengshuang, XU Haijiao. Online Semi-supervised Cross-modal Hashing Based on Anchor Graph Classification [J]. Computer Science, 2023, 50(6): 183-193.
[8]	DING Jiwen, LIU Zhuojin, WANG Jiaxing, ZHANG Yanfeng, YU Ge. LayerLSB:Nearest Neighbors Search Based on Layered Locality Sensitive B-tree [J]. Computer Science, 2023, 50(4): 32-39.
[9]	YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang. Text-Image Cross-modal Retrieval Based on Transformer [J]. Computer Science, 2023, 50(4): 141-148.
[10]	LIANG Weiliang, LI Yue, WANG Pengfei. Lightweight Face Generation Method Based on TransEditor and Its Application Specification [J]. Computer Science, 2023, 50(2): 221-230.
[11]	CAO Jinjuan, QIAN Zhong, LI Peifeng. End-to-End Event Factuality Identification with Joint Model [J]. Computer Science, 2023, 50(2): 292-299.
[12]	CHEN Qiaosong, WU Jiliang, JIANG Bo, TAN Chongchong, SUN Kaiwei, DEN Xin, WANG Jin. Coupling Local Features and Global Representations for 2D Human Pose Estimation [J]. Computer Science, 2023, 50(11A): 221100007-5.
[13]	SUN Kaixin, LIU Bin, SU Shuguang. Medical Microscopic Image Segmentation Model Based on CNN Structure and Swin Transformer [J]. Computer Science, 2023, 50(11A): 230200119-8.
[14]	XU Wentao, WANG Binjun. Backdoor Defense of Horizontal Federated Learning Based on Random Cutting and GradientClipping [J]. Computer Science, 2023, 50(11): 356-363.
[15]	LIU Qidong, LIU Chaoyue, QIU Zixin, GAO Zhimin, GUO Shuai, LIU Jizhao, FU Mingsheng. Time-aware Transformer for Traffic Flow Forecasting [J]. Computer Science, 2023, 50(11): 88-96.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Cross-modal Hash Retrieval Based on Text-guided Image Semantic Fusion

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0