计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 231200185-5.doi: 10.11896/jsjkx.231200185
朱旭东, 赖腾
ZHU Xudong, LAI Teng
摘要: 场景图生成方法(SGG)主要研究图像中的实体及其关系,广泛应用于视觉理解与图像检索等领域。现有的场景图生成方法受限于视觉特征或单一视觉概念,导致关系识别准确率较低,且需要大量的人工标注。为解决上述问题,文中融合图像和文本特征,提出了一种基于多模态对比学习的场景图生成方法MCL-SG(Multimodal Contrastive Learning for Scene Graph)。首先,对图像和文本输入进行特征提取,得到图像和文本特征;然后,使用Transformer Encoder编码器对特征向量进行编码和融合;最后,采用对比学习的自监督策略,计算图像和文本特征的相似度,通过最小化正样本和负样本之间的相似度差异完成训练,无需人工标注。通过大型场景图生成公开数据集VG(Visual Genome)的3个不同层次子任务(即SGDet,SGCls和PredCls)的实验表明:在mean Recall@100指标中,MCL-SG的场景图检测准确率提升9.8%,场景图分类准确率提升14.0%,关系分类准确率提升8.9%,从而证明了MCL-SG的有效性。
中图分类号:
[1]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678. [2]WANG S,WANG R,YAO Z,et al.Cross- modal scene graphmatching for relationship-aware image-text retrieval[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2020:1508-1517. [3]GHOSH S,BURACHAS G,RAY A,et al.Generating natural language explanations for visual question answering using scene graphs and visual attention[J].arXiv:1902.05715,2019. [4]DAMODARAN V,CHAKRAVARTHY S,KUMAR A,et al.Understanding the role of scene graphs in visual question answering[J].arXiv:2101.05479,2021. [5]ADITYA S,YANG Y,BARAL C,et al.Image understandingusing vision and reasoning through scene description graph[J].Computer Vision and Image Understanding,2018,173:33-45. [6]ZHANG J,KALANTIDIS Y,ROHRBACH M,et al.Large-scale visual relationship understanding [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9185-9194. [7]RENS,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analaysis and Machine Intelligence,2016,39(6):1137-1149. [8]YANG J,LU J,LEE S,et al.Visual curiosity:Learning to ask questions to learn visual recognition[J].arXiv:1810.00912,2018. [9]JERBI A,HERZIG R,BERANT J,et al.Learning object detection from captions via textual scene attributes[J].arXiv:2009.14558,2020. [10]YE K,ZHANG M,KOVASHKA A,et al.Cap2det:Learning to amplify weak caption supervision for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9686-9695. [11]ZAREIAN A,ROSA K D,HU D H,et al.Open-vocabulary object detection using captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14393-14402. [12]LU C,KRISHNA R,BERNSTEIN M,et al.Visual relationship detection with language priors[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,Part I 14.Springer International Publishing,2016:852-869. [13]YANG J,LU J,LEE S,et al.Graph R-CNN for scene graph generation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:670-685. [14]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//International Conference on Machine Learning.PMLR,2020:1597-1607. [15]HE K,FAN H,WU Y,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738. [16]GRILL J B,STRUB F,ALTCHé F,et al.Bootstrap your own latent-a new approach to self-supervised learning[J].Advances in Neural Information Processing Systems,2020,33:21271-21284. [17]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [18]SCHUSTER S,KRISHNA R,CHANG A,et al.Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C]//Proceedings of the Fourth Workshop on Vision and Language.2015:70-80. [19]WU H,MAO J,ZHANG Y,et al.Unified visual-semantic em-beddings:Bridging vision and language with structured meaning representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6609-6618. [20]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120. [21]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73. [22]XU D,ZHU Y,CHOY C B,et al.Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5410-5419. [23]ZELLERS R,YATSKAR M,THOMSON S,et al.Neural motifs:Scene graph parsing with global context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:5831-5840. [24]WANG W,WANG R,SHAN S,et al.Sketching image gist:Human-mimetic hierarchical scene graph generation[C]//European conference on computer vision.Cham:Springer International Publishing,2020:222-239. [25]TANG K,NIU Y,HUANG J,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3716-3725. [26]ZAREIAN A,KARAMAN S,CHANG S F.Weakly supervised visual semantic parsing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3736-3745. [27]SHI J,ZHONG Y,XU N,et al.A simple baseline for weakly-su-pervised scene graph generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:16393-16402. [28]YE K,KOVASHKA A.Linguistic structures as weak supervision for visual scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8289-8299. [29]ZHONG Y,SHI J,YANG J,et al.Learning to generate scene graph from natural language supervision[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:1823-1834. [30]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018. [31]CHEN X,XIES,HE K.An empirical study of training self-supervised vision transformers[C]//CVF International Conference on Computer Vision(ICCV).2021:9620-9629. [32]TANG K,ZHANG H,WU B,et al.Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6619-6628. [33]SUHAIL M,MITTAL A,SIDDIQUIE B,et al.Energy-basedlearning for scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:13936-13945. [34]KHANDELWAL S,SUHAIL M,SIGAL L.Segmentation-grounded scene graph generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:15879-15889. |
|