计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 168-174.doi: 10.11896/jsjkx.230700084

• 计算机图形学&多媒体 • 上一篇    下一篇

面向多视角对比学习和语义增强的多模态预训练方法

汤嘉1, 郭燕1,2, 叶名玮1, 吴桂兴1,2   

  1. 1 中国科学技术大学软件学院 合肥230026
    2 中国科学技术大学苏州高等研究院 江苏 苏州215123
  • 收稿日期:2023-07-12 修回日期:2023-11-19 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 郭燕(guoyan@ustc.edu.cn)
  • 作者简介:(tangjia_21@163.com)

Multimodal Pre-training Method for Multi-view Contrastive Learning and Semantic Enhancement

TANG Jia1, GUO Yan1,2, YE Mingwei1, WU Guixing1,2   

  1. 1 School of Software Engineering,University of Science and Technology of China,Hefei 230026,China
    2 Suzhou Institute for Advanced Study,University of Science and Technology of China,Suzhou,Jiangsu 215123,China
  • Received:2023-07-12 Revised:2023-11-19 Online:2024-01-15 Published:2024-01-12
  • About author:TANG Jia,born in 1999,master.His main research interests include multimodal and natural language processing.
    GUO Yan,born in 1981,lecturer.Her main research interests include information security,NLP and blockchain.

摘要: 视觉语言预训练(VLP)模型通过对比学习等方法,在多模态任务上表现出了优异的性能。然而现有研究忽视了多视角描述带来的好处,以及语义和语法的重要性。为了解决这一问题,文中提出了多视角对比学习和语义增强多模态预训练(Multi-view learning and Semantic Enhancement for Multimodal pre-training,MulSE)模型。MulSE主要分为3个部分:1)在融合编码器模型中,引入带有生成器的多视角对比学习;2)提出了一种新的自监督视觉语言预训练任务——多模态文本重排序;3)增加并探寻最优MLM掩码比例,最大化利用视觉信息的能力。通过改进预训练任务,采取多种最优策略,并通过实验验证MulSE增强了模态内部和模态间的理解能力以及对文本语法和语义的理解能力。预训练仅用4×106的数据量,在图文检索任务中就达到了先前大型数据集的效果,且其在视觉问答和视觉蕴含任务上的评估效果优于先前的理解式VLP模型。

关键词: 计算机视觉, 多模态, 预训练, 多视角, 理解增强

Abstract: The visual language pretraining(VLP) model has shown impressive performance on multimodal tasks through con-trastive learning and other methods.However,existing research has overlooked the benefits of multi-view descriptions,andthe importance of semantics and grammar.To address this issue,this paper proposes multi-view learning and semantic enhancement for multimodal pre-training(MulSE),which consists of the following three main components:1)introducing multi-view contrastive learning with a generator in the fused encoder model;2)proposing multimodal text reordering as a novel self-supervised visual language pretraining task;3)increasing and exploring the optimal MLM masking ratio,maximizing the ability to use visual information.By improving the pretraining task and employing multiple optimal strategies,our experiments demonstrate that MulSE enhances intra-modal and inter-modal understanding,improves the comprehension of syntax and semantics within text.With only 4M pre-training data volume,it achieves the results of previous large datasets in the graphic retrieval task,and the valuation result on visual question-answering and visual implicative tasks outperforms the previous comprehension VLP models.

Key words: Computer Version, Multimodal, Pre-training, Multi-view, Comprehension augment

中图分类号: 

  • TP391
[1]LU J,BATRA D,PARIKH D,et al.ViLBERT:pretrainingtask- agnostic visiolinguistic representations for vision-and-language tasks[C]//33rd Annual Conference on Neural Information Processing Systems.2019:1-11.
[2]SU W J,ZHU X Z,CAO Y,et al.Vl-bert:pre-training of generic visual-linguistic representations[C]//8th International Confe-rence on Learning Representations(ICLR).2020.
[3]LI X J,YIN X,LI C Y,et al.Oscar:object-semantics alignedpre-training for vision-language tasks[C]//Computer Vision-ECCV 2020:16th European Conference(ECCV).2020:121-137.
[4]YAO L,HUANG R,HOU L,et al.FILIP:fine-grained interactive lan guage-image pre-training[C]//10th International Conference on Learning Representations(ICLR).2022.
[5]RADFORD A,KIM J W,HALLACY C,et al.Learning transfe-rable visual models from natural language supervision[C]//International Conference on Machine Learning(ICML).2021:8748-8763.
[6]LI J N,SELVARAJU R R,GOTMARE A D,et al.Align before Fuse:vision and language representation learning with momentum distillation[C]//35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:9694-9705.
[7]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-Image Pre-training with Frozen Image Encoders and Large Language Models[J].arXiv:2301.12597,2023.
[8]HUANG S H,DONG L,WANG W H,et al.Language Is Not All You Need:Aligning Perception with Language Models[J].arXiv:2302.14045,2023.
[9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:5999-6009.
[10]BAO H B,WANG W H,DONG L,et al.Vlmo:unified visual-language pre-training with mixture-of-modality- experts[C]//36st Annual Conference on Neural Information Processing Systems(NeurIPS).2022:32897-32912.
[11]JIA C,YANG Y F,XIA Y,et al.Scaling up visual and vision-languagerepresentation learning with noisy text supervision[C]//International Conference on Machine Learning(ICML).2021:4904-4916.
[12]YU J,WANG Z,VASUDEVAN V,et al.Coca:contrastive captioners are image-text foundation models[J].arXiv 2205.01917,2022.
[13]SINGH A,HU R H,GOSWAMI V,et al.Flava:a foundational language andvision alignment model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15638-15650.
[14]TIAN Y L,KRISHNAN D,ISOLA P.Contrastive multiviewcoding[C]//computer Vision-ECCV 2020:16th European Conference(ECCV).2020:776-794.
[15]SHAN B,YIN W C,SUN Y,et al.ERNIE-ViL 2.0:multi-view contrastive learning for image-text pre-training[J].arXiv:2209.15270,2022.
[16]GAO T Y,YAO X C,CHEN D Q.SimCSE:simple contrastive learning of sentence embeddings[C]//2021 Conference on Empirical Methods in Natural Language Processing(EMNLP).2021:6894-6910.
[17]CLARK K,LUONG M T,LE Q V,et al.ELECTRA:pre-trai-ning text encoders as discriminators rather than generators[C]//8th International Conference on Learning Representations(ICLR).2020.
[18]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:6077-6086.
[19]LIANG X N,CUI C H,WU SZ,et al.Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J].arXiv:2208.11303,2022.
[20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR).2021.
[21]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:4171-4186.
[22]LI X,LIU X P,LI W C,et al.Survey on contrastive learning research[J].Journal of Chinese Computer Systems,2023,44(4):787-797.
[23]TAN H,BANSAL M.LXMert:learningcross-modality encoder rep- resentations from transformers[C]//2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:5100-5111.
[24]SELVARAJU R R,COGSWELLM,DAS A,et al.Grad-CAM:visual explanations from deep networks via gradient-based loca-lization[C]//16th IEEE International Conference on Computer Vision(ICCV).2017:618-626.
[25]BYUN J,HWANG T,FU J L,et al.GRIT-VLP:grouped mini-batch sampling for efficient vision and language pre-training[C]//17th European Conference on Computer Vision(ECCV).2022:395-412.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!