计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 231200185-5.doi: 10.11896/jsjkx.231200185

• 图像处理&多媒体技术 • 上一篇    下一篇

基于多模态对比学习的场景图生成方法

朱旭东, 赖腾   

  1. 西安建筑科技大学信息与控制工程学院 西安 710055
  • 出版日期:2024-11-16 发布日期:2024-11-13
  • 通讯作者: 赖腾(15183874170@163.com)
  • 作者简介:(zhudongxu@vip.sina.com)
  • 基金资助:
    国家重点研发计划(2019YFD1100901)

Multimodal Contrastive Learning Based Scene Graph Generation

ZHU Xudong, LAI Teng   

  1. College of Information and Control Engineering,Xi'an University of Architecture and Technology,Xi'an 710055,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:ZHU Xudong,born in 1973,Ph.D,associate professor,master supervisor.His main research interests include privacy-preserving and scene graph generation.
    LAI Teng,born in 1998,postgraduate.His main research interests include scene graph generation and deep lear-ning.
  • Supported by:
    National Key Research and Development Program of China(2019YFD1100901).

摘要: 场景图生成方法(SGG)主要研究图像中的实体及其关系,广泛应用于视觉理解与图像检索等领域。现有的场景图生成方法受限于视觉特征或单一视觉概念,导致关系识别准确率较低,且需要大量的人工标注。为解决上述问题,文中融合图像和文本特征,提出了一种基于多模态对比学习的场景图生成方法MCL-SG(Multimodal Contrastive Learning for Scene Graph)。首先,对图像和文本输入进行特征提取,得到图像和文本特征;然后,使用Transformer Encoder编码器对特征向量进行编码和融合;最后,采用对比学习的自监督策略,计算图像和文本特征的相似度,通过最小化正样本和负样本之间的相似度差异完成训练,无需人工标注。通过大型场景图生成公开数据集VG(Visual Genome)的3个不同层次子任务(即SGDet,SGCls和PredCls)的实验表明:在mean Recall@100指标中,MCL-SG的场景图检测准确率提升9.8%,场景图分类准确率提升14.0%,关系分类准确率提升8.9%,从而证明了MCL-SG的有效性。

关键词: 场景图生成, Transformer模型, 多模态, 对比学习, 目标检测

Abstract: Scene graph generation(SGG) methods play a pivotal role in studying objects and their relationships within images,with widespread applications in visual understanding and image retrieval.However,existing SGG methods are limited by visual features or individual visual concepts such as objects,resulting in a low accuracy of relationship recognition and necessitating a substantial amount of manual annotation.To address the aforementioned issues,this paper integrates image and text features and proposes a multimodal contrastive learning based scene graph generation method,multimodal contrastive learning for scene graph(MCL-SG).This method begins by extracting features from both image and text inputs,obtaining image and text features.Subsequently,a Transformer Encoder is employed to encode and fuse feature vectors,enabling a synergistic integration of information from diverse sources.Notably,MCL-SG incorporates a self-supervised contrastive learning strategy,calculating the similarity between image and text features.Training is accomplished by minimizing the dissimilarity between positive and negative samples,eliminating the need for extensive manual annotation.In this study,experiments are conducted using the VG(Visual Genome) dataset,a substantial public dataset for scene graph generation.Experiments are structured into three distinct hierarchical subtasks:SGDet,SGCls,and PredCls and the results demonstrate that,in the mean Recall@100 metric,MCL-SG achieves a 9.8% improvement in scene graph detection,a significant 14.0% enhancement in scene graph classification,and an 8.9% boost in relationship classification,thus proving the effectiveness of MCL-SG.

Key words: Scene graph generation, Transformer model, Multimodal, Contrastive learning, Object detection

中图分类号: 

  • TP391
[1]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[2]WANG S,WANG R,YAO Z,et al.Cross- modal scene graphmatching for relationship-aware image-text retrieval[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2020:1508-1517.
[3]GHOSH S,BURACHAS G,RAY A,et al.Generating natural language explanations for visual question answering using scene graphs and visual attention[J].arXiv:1902.05715,2019.
[4]DAMODARAN V,CHAKRAVARTHY S,KUMAR A,et al.Understanding the role of scene graphs in visual question answering[J].arXiv:2101.05479,2021.
[5]ADITYA S,YANG Y,BARAL C,et al.Image understandingusing vision and reasoning through scene description graph[J].Computer Vision and Image Understanding,2018,173:33-45.
[6]ZHANG J,KALANTIDIS Y,ROHRBACH M,et al.Large-scale visual relationship understanding [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9185-9194.
[7]RENS,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analaysis and Machine Intelligence,2016,39(6):1137-1149.
[8]YANG J,LU J,LEE S,et al.Visual curiosity:Learning to ask questions to learn visual recognition[J].arXiv:1810.00912,2018.
[9]JERBI A,HERZIG R,BERANT J,et al.Learning object detection from captions via textual scene attributes[J].arXiv:2009.14558,2020.
[10]YE K,ZHANG M,KOVASHKA A,et al.Cap2det:Learning to amplify weak caption supervision for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9686-9695.
[11]ZAREIAN A,ROSA K D,HU D H,et al.Open-vocabulary object detection using captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14393-14402.
[12]LU C,KRISHNA R,BERNSTEIN M,et al.Visual relationship detection with language priors[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,Part I 14.Springer International Publishing,2016:852-869.
[13]YANG J,LU J,LEE S,et al.Graph R-CNN for scene graph generation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:670-685.
[14]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//International Conference on Machine Learning.PMLR,2020:1597-1607.
[15]HE K,FAN H,WU Y,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738.
[16]GRILL J B,STRUB F,ALTCHé F,et al.Bootstrap your own latent-a new approach to self-supervised learning[J].Advances in Neural Information Processing Systems,2020,33:21271-21284.
[17]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[18]SCHUSTER S,KRISHNA R,CHANG A,et al.Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C]//Proceedings of the Fourth Workshop on Vision and Language.2015:70-80.
[19]WU H,MAO J,ZHANG Y,et al.Unified visual-semantic em-beddings:Bridging vision and language with structured meaning representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6609-6618.
[20]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120.
[21]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[22]XU D,ZHU Y,CHOY C B,et al.Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5410-5419.
[23]ZELLERS R,YATSKAR M,THOMSON S,et al.Neural motifs:Scene graph parsing with global context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:5831-5840.
[24]WANG W,WANG R,SHAN S,et al.Sketching image gist:Human-mimetic hierarchical scene graph generation[C]//European conference on computer vision.Cham:Springer International Publishing,2020:222-239.
[25]TANG K,NIU Y,HUANG J,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3716-3725.
[26]ZAREIAN A,KARAMAN S,CHANG S F.Weakly supervised visual semantic parsing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3736-3745.
[27]SHI J,ZHONG Y,XU N,et al.A simple baseline for weakly-su-pervised scene graph generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:16393-16402.
[28]YE K,KOVASHKA A.Linguistic structures as weak supervision for visual scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8289-8299.
[29]ZHONG Y,SHI J,YANG J,et al.Learning to generate scene graph from natural language supervision[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:1823-1834.
[30]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[31]CHEN X,XIES,HE K.An empirical study of training self-supervised vision transformers[C]//CVF International Conference on Computer Vision(ICCV).2021:9620-9629.
[32]TANG K,ZHANG H,WU B,et al.Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6619-6628.
[33]SUHAIL M,MITTAL A,SIDDIQUIE B,et al.Energy-basedlearning for scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:13936-13945.
[34]KHANDELWAL S,SUHAIL M,SIGAL L.Segmentation-grounded scene graph generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:15879-15889.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!