Computer Science ›› 2024, Vol. 51 ›› Issue (11A): 231200185-5.doi: 10.11896/jsjkx.231200185

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Multimodal Contrastive Learning Based Scene Graph Generation

ZHU Xudong, LAI Teng   

  1. College of Information and Control Engineering,Xi'an University of Architecture and Technology,Xi'an 710055,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:ZHU Xudong,born in 1973,Ph.D,associate professor,master supervisor.His main research interests include privacy-preserving and scene graph generation.
    LAI Teng,born in 1998,postgraduate.His main research interests include scene graph generation and deep lear-ning.
  • Supported by:
    National Key Research and Development Program of China(2019YFD1100901).

Abstract: Scene graph generation(SGG) methods play a pivotal role in studying objects and their relationships within images,with widespread applications in visual understanding and image retrieval.However,existing SGG methods are limited by visual features or individual visual concepts such as objects,resulting in a low accuracy of relationship recognition and necessitating a substantial amount of manual annotation.To address the aforementioned issues,this paper integrates image and text features and proposes a multimodal contrastive learning based scene graph generation method,multimodal contrastive learning for scene graph(MCL-SG).This method begins by extracting features from both image and text inputs,obtaining image and text features.Subsequently,a Transformer Encoder is employed to encode and fuse feature vectors,enabling a synergistic integration of information from diverse sources.Notably,MCL-SG incorporates a self-supervised contrastive learning strategy,calculating the similarity between image and text features.Training is accomplished by minimizing the dissimilarity between positive and negative samples,eliminating the need for extensive manual annotation.In this study,experiments are conducted using the VG(Visual Genome) dataset,a substantial public dataset for scene graph generation.Experiments are structured into three distinct hierarchical subtasks:SGDet,SGCls,and PredCls and the results demonstrate that,in the mean Recall@100 metric,MCL-SG achieves a 9.8% improvement in scene graph detection,a significant 14.0% enhancement in scene graph classification,and an 8.9% boost in relationship classification,thus proving the effectiveness of MCL-SG.

Key words: Scene graph generation, Transformer model, Multimodal, Contrastive learning, Object detection

CLC Number: 

  • TP391
[1]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[2]WANG S,WANG R,YAO Z,et al.Cross- modal scene graphmatching for relationship-aware image-text retrieval[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2020:1508-1517.
[3]GHOSH S,BURACHAS G,RAY A,et al.Generating natural language explanations for visual question answering using scene graphs and visual attention[J].arXiv:1902.05715,2019.
[4]DAMODARAN V,CHAKRAVARTHY S,KUMAR A,et al.Understanding the role of scene graphs in visual question answering[J].arXiv:2101.05479,2021.
[5]ADITYA S,YANG Y,BARAL C,et al.Image understandingusing vision and reasoning through scene description graph[J].Computer Vision and Image Understanding,2018,173:33-45.
[6]ZHANG J,KALANTIDIS Y,ROHRBACH M,et al.Large-scale visual relationship understanding [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9185-9194.
[7]RENS,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analaysis and Machine Intelligence,2016,39(6):1137-1149.
[8]YANG J,LU J,LEE S,et al.Visual curiosity:Learning to ask questions to learn visual recognition[J].arXiv:1810.00912,2018.
[9]JERBI A,HERZIG R,BERANT J,et al.Learning object detection from captions via textual scene attributes[J].arXiv:2009.14558,2020.
[10]YE K,ZHANG M,KOVASHKA A,et al.Cap2det:Learning to amplify weak caption supervision for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9686-9695.
[11]ZAREIAN A,ROSA K D,HU D H,et al.Open-vocabulary object detection using captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14393-14402.
[12]LU C,KRISHNA R,BERNSTEIN M,et al.Visual relationship detection with language priors[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,Part I 14.Springer International Publishing,2016:852-869.
[13]YANG J,LU J,LEE S,et al.Graph R-CNN for scene graph generation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:670-685.
[14]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//International Conference on Machine Learning.PMLR,2020:1597-1607.
[15]HE K,FAN H,WU Y,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738.
[16]GRILL J B,STRUB F,ALTCHé F,et al.Bootstrap your own latent-a new approach to self-supervised learning[J].Advances in Neural Information Processing Systems,2020,33:21271-21284.
[17]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[18]SCHUSTER S,KRISHNA R,CHANG A,et al.Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C]//Proceedings of the Fourth Workshop on Vision and Language.2015:70-80.
[19]WU H,MAO J,ZHANG Y,et al.Unified visual-semantic em-beddings:Bridging vision and language with structured meaning representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6609-6618.
[20]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120.
[21]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[22]XU D,ZHU Y,CHOY C B,et al.Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5410-5419.
[23]ZELLERS R,YATSKAR M,THOMSON S,et al.Neural motifs:Scene graph parsing with global context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:5831-5840.
[24]WANG W,WANG R,SHAN S,et al.Sketching image gist:Human-mimetic hierarchical scene graph generation[C]//European conference on computer vision.Cham:Springer International Publishing,2020:222-239.
[25]TANG K,NIU Y,HUANG J,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3716-3725.
[26]ZAREIAN A,KARAMAN S,CHANG S F.Weakly supervised visual semantic parsing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3736-3745.
[27]SHI J,ZHONG Y,XU N,et al.A simple baseline for weakly-su-pervised scene graph generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:16393-16402.
[28]YE K,KOVASHKA A.Linguistic structures as weak supervision for visual scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8289-8299.
[29]ZHONG Y,SHI J,YANG J,et al.Learning to generate scene graph from natural language supervision[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:1823-1834.
[30]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[31]CHEN X,XIES,HE K.An empirical study of training self-supervised vision transformers[C]//CVF International Conference on Computer Vision(ICCV).2021:9620-9629.
[32]TANG K,ZHANG H,WU B,et al.Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6619-6628.
[33]SUHAIL M,MITTAL A,SIDDIQUIE B,et al.Energy-basedlearning for scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:13936-13945.
[34]KHANDELWAL S,SUHAIL M,SIGAL L.Segmentation-grounded scene graph generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:15879-15889.
[1] WANG Jiahui, PENG Guangling, DUAN Liang, YUAN Guowu, YUE Kun. Few-shot Shadow Removal Method for Text Recognition [J]. Computer Science, 2024, 51(9): 147-154.
[2] LI Yunchen, ZHANG Rui, WANG Jiabao, LI Yang, WANG Ziqi, CHEN Yao. Re-parameterization Enhanced Dual-modal Realtime Object Detection Model [J]. Computer Science, 2024, 51(9): 162-172.
[3] ZHANG Tianzhi, ZHOU Gang, LIU Hongbo, LIU Shuo, CHEN Jing. Text-Image Gated Fusion Mechanism for Multimodal Aspect-based Sentiment Analysis [J]. Computer Science, 2024, 51(9): 242-249.
[4] MO Shuyuan, MENG Zuqiang. Multimodal Sentiment Analysis Model Based on Visual Semantics and Prompt Learning [J]. Computer Science, 2024, 51(9): 250-257.
[5] LIU Qian, BAI Zhihao, CHENG Chunling, GUI Yaocheng. Image-Text Sentiment Classification Model Based on Multi-scale Cross-modal Feature Fusion [J]. Computer Science, 2024, 51(9): 258-264.
[6] LU Xulin, LI Zhihua. IoT Device Recognition Method Combining Multimodal IoT Device Fingerprint and Ensemble Learning [J]. Computer Science, 2024, 51(9): 371-382.
[7] TIAN Sicheng, HUANG Shaobin, WANG Rui, LI Rongsheng, DU Zhijuan. Contrastive Learning-based Prompt Generation Method for Large-scale Language Model ReverseDictionary Task [J]. Computer Science, 2024, 51(8): 256-262.
[8] WEI Xiangxiang, MENG Zhaohui. Hohai Graphic Protein Data Bank and Prediction Model [J]. Computer Science, 2024, 51(8): 117-123.
[9] PU Bin, LIANG Zhengyou, SUN Yu. Monocular 3D Object Detection Based on Height-Depth Constraint and Edge Fusion [J]. Computer Science, 2024, 51(8): 192-199.
[10] WANG Chao, TANG Chao, WANG Wenjian, ZHANG Jing. Infrared Human Action Recognition Method Based on Multimodal Attention Network [J]. Computer Science, 2024, 51(8): 232-241.
[11] YAN Qiuyan, SUN Hao, SI Yuqing, YUAN Guan. Multimodality and Forgetting Mechanisms Model for Knowledge Tracing [J]. Computer Science, 2024, 51(7): 133-139.
[12] HU Haibo, YANG Dan, NIE Tiezheng, KOU Yue. Graph Contrastive Learning Incorporating Multi-influence and Preference for Social Recommendation [J]. Computer Science, 2024, 51(7): 146-155.
[13] LOU Zhengzheng, ZHANG Xin, HU Shizhe, WU Yunpeng. Foggy Weather Object Detection Method Based on YOLOX_s [J]. Computer Science, 2024, 51(7): 206-213.
[14] TIAN Qing, LU Zhanghu, YANG Hong. Unsupervised Domain Adaptation Based on Entropy Filtering and Class Centroid Optimization [J]. Computer Science, 2024, 51(7): 345-353.
[15] WANG Yingjie, ZHANG Chengye, BAI Fengbo, WANG Zumin. Named Entity Recognition Approach of Judicial Documents Based on Transformer [J]. Computer Science, 2024, 51(6A): 230500164-9.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!