计算机科学 ›› 2022, Vol. 49 ›› Issue (9): 123-131.doi: 10.11896/jsjkx.220600011

• 计算机图形学&多媒体 • 上一篇    下一篇

基于细粒度语义推理的跨媒体双路对抗哈希学习模型

曹晓雯, 梁美玉, 鲁康康   

  1. 北京邮电大学计算机学院(国家示范性软件学院)智能通信软件与多媒体北京市重点实验室 北京 100876
  • 收稿日期:2022-06-02 修回日期:2022-07-05 出版日期:2022-09-15 发布日期:2022-09-09
  • 通讯作者: 梁美玉(meiyu1210@bupt.edu.cn)
  • 作者简介:(xwcao@bupt.edu.cn)
  • 基金资助:
    国家自然科学基金(61877006,62192784);中国人工智能学会-华为MindSpore学术奖励基金(CAAIXSJLJJ-2021-007B)

Fine-grained Semantic Reasoning Based Cross-media Dual-way Adversarial Hashing Learning Model

CAO Xiao-wen, LIANG Mei-yu, LU Kang-kang   

  1. Beijing Key Laboratory of Intelligent Communication Software and Multimedia,School of Computer(National Pilot Software Engineering School),Beijing University of Posts and Telecommunications,Beijing 100876,China
  • Received:2022-06-02 Revised:2022-07-05 Online:2022-09-15 Published:2022-09-09
  • About author:CAO Xiao-wen,born in 1998,master.Her main research interests include deep learning and cross-modal retrieval.
    LIANG Mei-yu,born in 1985,associate professor,master supervisor.Her main research interests include artificial intelligence,data mining,multimedia information processing and computer vision.
  • Supported by:
    National Natural Science Foundation of China(61877006,62192784) and CAAI-Huawei MindSpore Open Fund(CAAIXSJLJJ-2021-007B).

摘要: 跨媒体哈希因其优越的搜索效率和较低的存储成本而在跨媒体搜索任务中受到广泛关注。然而,现有方法无法充分保持多模态数据的高阶语义相关性和多标签语义信息,从而导致学习到的哈希编码的质量下降。为了解决上述问题,提出了基于细粒度语义推理的跨媒体双路对抗哈希(Semantic Reasoning Based Cross-media Dual-way Adversarial Hashing Learning Model,SDAH)学习模型,通过最大程度地挖掘不同模态间的细粒度语义关联,产生紧凑且一致的跨媒体统一高效哈希语义表示。首先,提出了基于跨媒体协同注意力机制的细粒度跨媒体语义关联学习和推理方法,基于跨媒体注意力机制协同学习图像和文本的细粒度隐含语义关联,获取图像和文本的显著性语义推理特征;然后,建立了跨媒体双路对抗哈希网络,通过联合学习模态内和模态间的语义相似性约束,并通过双路对抗学习机制更好地对齐不同模态哈希码的语义分布,产生更高质量和更具判别性的跨媒体统一哈希表示,促进了跨媒体语义融合,提升了跨媒体搜索性能。在两个公开数据集上与现有方法的对比实验结果验证了所提方法在各种跨媒体搜索场景下的优越性能。

关键词: 语义推理, 哈希学习, 跨媒体搜索, 对抗学习, 跨媒体语义融合

Abstract: Cross-media hashing has received extensive attention in cross-media searching tasks due to its superior searching efficiency and low storage cost.However,existing methods cannot adequately preserve the high-level semantic relevance and multi-label of multi-media data.In order to solve the above problems,this paper proposes a fine-grained semantic reasoning based cross-media dual-way adversarial hashing learning model(SDAH),which generates compact and consistent cross-media unified efficient hash semantic representations by maximizing fine-grained semantic associations between different medias.First,a fine-grained cross-media semantic association learning and inference method based on the cross-media collaborative attention mechanism is proposed.The cross-media attention mechanism collaboratively learns the fine-grained implicit semantic associations of images and texts,and obtains the salient semantic inference features of images and texts.Then,a cross-media dual-way adversarial hashing network is established to jointly learn the intra-modality and inter-modality semantic similarity constraints,and better to align the semantic distributions of different media hash codes through a two-way adversarial learning mechanism,which generates higher-quality and more discriminative cross-media unified hash representation,facilitates the process of cross-media semantic fusion and improves the cross-media searching performance.Experimental results compared with existing methods on two public datasets verify the performance superiority of the proposed method in various cross-media search scenarios.

Key words: Semantic reasoning, Hash learning, Cross-media search, Adversarial learning, Cross-media semantic fusion

中图分类号: 

  • TP391
[1]LIU S,QIAN S S,GUAN Y,et al.Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1379-1388.
[2]ZHANG Y F,ZHOU W G,WANG M,et al.Deep relation embedding for cross-modal retrieval[J].IEEE Transactions on Image Processing,2020,30:617-627.
[3]HE Y,LIU X,CHEUNG Y M,et al.Cross-Graph AttentionEnhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval[C]//Proceedings of the 44th Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval.2021:1865-1869.
[4]ZHANG P F,DUAN J S,HUANG Z,et al.Joint-teaching:Learning to Refine Knowledge for Resource-constrained Unsupervised Cross-modal Retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:1517-1525.
[5]ZHANG D Q,LI W J.Large-scale supervised multimodal ha-shing with semantic correlation maximization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2014:2177-2183.
[6]LIN Z J,DING G G,HU M Q,et al.Semantics-preserving ha-shing for cross-view retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3864-3872.
[7]LIU X B,NIE X S,SUN H L,et al.Modality-specific structure preserving hashing for cross-modal retrieval[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:1678-1682.
[8]LIANG M Y,DU J P,YANG C X,et al.Cross-Media Semantic Correlation Learning Based on Deep Hash Network and Semantic Expansion for Social Network Cross-Media Search[J].IEEE Transactions on Neural Networks and Learning Systems,2020,31(9):3634-3648.
[9]DEVRAJ M,KUNAL N C,SOMA B.Generalized semantic preserving hashing for cross-modal retrieval[J].IEEE Transations on Image Processing,2018,28(1):102-112.
[10]CHEN Z D,WANG Y X,LI H Q,et al.A two-step cross-modal hashing by exploiting label correlations and preserving similarity in both steps[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:1694-1702.
[11]JIANG Q Y,LI W J.Deep cross-modal hashing[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3232-3240.
[12]GU W,GU X Y,GU J Z,et al.Adversary guided asymmetric hashing for cross-modal retrieval[C]//Proceedings of the 2019International Conference on Multimedia Retrieval.2019:159-167.
[13]WANG X Z,ZOU X T,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271.
[14]ZOU X T,WANG X Z,BAKKER E M,et al.Multi-label semantics preserving based deep cross-modal hashing[J].Signal Processing:Image Communication,2021,93:116131.
[15]YAO H L,ZHAN Y W,CHEN Z D,et al.TEACH:Attention-Aware Deep Cross-Modal Hashing[C]//Proceedings of the 2021 International Conference on Multimedia Retrieval.2021:376-384.
[16]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].arXiv:1406.2661,2014.
[17]WANG B L,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Confe-rence on Multimedia.2017:154-162.
[18]LI C,DENG C,LI N,et al.Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4242-4251.
[19]BAI C,ZENG C,MA Q,et al.Deep adversarial discrete hashing for cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval.2020:525-531.
[20]HAN L G,MIN M R,STATHOPOULOS A,et al.Dual projection generative adversarial networks for conditional imagegene-ration[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:14438-14447.
[21]KARRAS T,AITTALA M,HELLSTEN J,et al.Training gene-rative adversarial networks with limited data[J].Advances in Neural Information Processing Systems,2020,33:12104-12114.
[22]SANTORO A,RAPOSO D,BARRETT D G,et al.A simple neural network module for relational reasoning[J].arXiv:1706.01427,2017.
[23]MESSINA N,AMATO G,CARRARA F,et al.Learning visual features for relational CBIR[J].International Journal of Multimedia Information Retrieval,2020,9(2):113-124.
[24]MESSINA N,AMATO G,CARRARA F,et al.Learning rela-tionship-aware visual features[C]//Proceedings of the Euro-pean Conference on Computer Vision(ECCV) Workshops.2018:486-501.
[25]HU R H,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[26]ZHENG W F,LIU X J,NI X B,et al.Improving visual reaso-ning through semantic representation[J].IEEE Access,2021,9:91476-91486.
[27]WANG J B,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020,98:107075.
[28]YANG L,HU H,LU X L,et al.Constrained lstm and residual attention for image captioning[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2020,16(3):1-18.
[29]LI Y K,OUYANG W L,ZHOU B,et al.Factorizable net:anefficient subgraph-based framework for scene graph generation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:335-351.
[30]REN S Q,HE K M,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2017,39(6):1137-1149.
[31]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1(Long and Short Papers).2019:4171-4186.
[32]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762 2017.
[33]MESSINA N,FALCHI F,ESULI A,et al.Transformer reaso-ning network for image-text matching and retrieval[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:5222-5229.
[34]ZHAO F,HUANG Y,WANG L,et al.Deep semantic ranking based hashing for multi-label image retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:1556-1564.
[35]HUISKES M J,LEW M S.The mir flickr retrieval evaluation[C]//Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval.2008:39-43.
[36]CHUA T S,TANG J H,HONG R C,et al.Nus-wide:a real-world web image database from national university of singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval.2009:1-9.
[37]WOLF T,DEBUT L,SANH V,et al.Huggingface's transfor-mers:State-of-the-art natural language processing[J].arXiv:1910.03771,2019.
[38]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[1] 侯宏旭, 孙硕, 乌尼尔.
蒙汉神经机器翻译研究综述
Survey of Mongolian-Chinese Neural Machine Translation
计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[2] 刘立波, 苟婷婷.
融合深度典型相关分析和对抗学习的跨模态检索
Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning
计算机科学, 2021, 48(9): 200-207. https://doi.org/10.11896/jsjkx.200600119
[3] 王胜, 张仰森, 陈若愚, 向尕.
基于细粒度差异特征的文本匹配方法
Text Matching Method Based on Fine-grained Difference Features
计算机科学, 2021, 48(8): 60-65. https://doi.org/10.11896/jsjkx.200700008
[4] 孙全, 曾晓勤.
基于生成对抗网络的图像修复
Image Inpainting Based on Generative Adversarial Networks
计算机科学, 2018, 45(12): 229-234. https://doi.org/10.11896/j.issn.1002-137X.2018.12.038
[5] 陈恒.
一种基于Spark的大规模语义数据分布式推理框架
Spark Based Large-scale Semantic Data Distributed Reasoning Framework
计算机科学, 2016, 43(Z11): 93-96. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.020
[6] 崔华,应时,袁文杰,胡罗凯.
语义Web服务组合综述
Review of Semantic Web Service Composition
计算机科学, 2010, 37(5): 21-25.
[7] .
基于粒语义推理的粒归结研究

计算机科学, 2009, 36(1): 171-176.
[8] 危辉 危炜.
言语获取、理解和生成过程中的语义推理问题

计算机科学, 2002, 29(5): 94-96.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!