面向多视角对比学习和语义增强的多模态预训练方法

doi:10.11896/jsjkx.230700084

Abstract

Abstract: The visual language pretraining(VLP) model has shown impressive performance on multimodal tasks through con-trastive learning and other methods.However,existing research has overlooked the benefits of multi-view descriptions,andthe importance of semantics and grammar.To address this issue,this paper proposes multi-view learning and semantic enhancement for multimodal pre-training(MulSE),which consists of the following three main components:1)introducing multi-view contrastive learning with a generator in the fused encoder model;2)proposing multimodal text reordering as a novel self-supervised visual language pretraining task;3)increasing and exploring the optimal MLM masking ratio,maximizing the ability to use visual information.By improving the pretraining task and employing multiple optimal strategies,our experiments demonstrate that MulSE enhances intra-modal and inter-modal understanding,improves the comprehension of syntax and semantics within text.With only 4M pre-training data volume,it achieves the results of previous large datasets in the graphic retrieval task,and the valuation result on visual question-answering and visual implicative tasks outperforms the previous comprehension VLP models.

Key words: Computer Version, Multimodal, Pre-training, Multi-view, Comprehension augment

CLC Number:

TP391

TANG Jia, GUO Yan, YE Mingwei, WU Guixing. Multimodal Pre-training Method for Multi-view Contrastive Learning and Semantic Enhancement[J].Computer Science, 2024, 51(1): 168-174.

References

[1]LU J,BATRA D,PARIKH D,et al.ViLBERT:pretrainingtask- agnostic visiolinguistic representations for vision-and-language tasks[C]//33rd Annual Conference on Neural Information Processing Systems.2019:1-11.
[2]SU W J,ZHU X Z,CAO Y,et al.Vl-bert:pre-training of generic visual-linguistic representations[C]//8th International Confe-rence on Learning Representations(ICLR).2020.
[3]LI X J,YIN X,LI C Y,et al.Oscar:object-semantics alignedpre-training for vision-language tasks[C]//Computer Vision-ECCV 2020:16th European Conference(ECCV).2020:121-137.
[4]YAO L,HUANG R,HOU L,et al.FILIP:fine-grained interactive lan guage-image pre-training[C]//10th International Conference on Learning Representations(ICLR).2022.
[5]RADFORD A,KIM J W,HALLACY C,et al.Learning transfe-rable visual models from natural language supervision[C]//International Conference on Machine Learning(ICML).2021:8748-8763.
[6]LI J N,SELVARAJU R R,GOTMARE A D,et al.Align before Fuse:vision and language representation learning with momentum distillation[C]//35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:9694-9705.
[7]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-Image Pre-training with Frozen Image Encoders and Large Language Models[J].arXiv:2301.12597,2023.
[8]HUANG S H,DONG L,WANG W H,et al.Language Is Not All You Need:Aligning Perception with Language Models[J].arXiv:2302.14045,2023.
[9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:5999-6009.
[10]BAO H B,WANG W H,DONG L,et al.Vlmo:unified visual-language pre-training with mixture-of-modality- experts[C]//36st Annual Conference on Neural Information Processing Systems(NeurIPS).2022:32897-32912.
[11]JIA C,YANG Y F,XIA Y,et al.Scaling up visual and vision-languagerepresentation learning with noisy text supervision[C]//International Conference on Machine Learning(ICML).2021:4904-4916.
[12]YU J,WANG Z,VASUDEVAN V,et al.Coca:contrastive captioners are image-text foundation models[J].arXiv 2205.01917,2022.
[13]SINGH A,HU R H,GOSWAMI V,et al.Flava:a foundational language andvision alignment model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15638-15650.
[14]TIAN Y L,KRISHNAN D,ISOLA P.Contrastive multiviewcoding[C]//computer Vision-ECCV 2020:16th European Conference(ECCV).2020:776-794.
[15]SHAN B,YIN W C,SUN Y,et al.ERNIE-ViL 2.0:multi-view contrastive learning for image-text pre-training[J].arXiv:2209.15270,2022.
[16]GAO T Y,YAO X C,CHEN D Q.SimCSE:simple contrastive learning of sentence embeddings[C]//2021 Conference on Empirical Methods in Natural Language Processing(EMNLP).2021:6894-6910.
[17]CLARK K,LUONG M T,LE Q V,et al.ELECTRA:pre-trai-ning text encoders as discriminators rather than generators[C]//8th International Conference on Learning Representations(ICLR).2020.
[18]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:6077-6086.
[19]LIANG X N,CUI C H,WU SZ,et al.Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J].arXiv:2208.11303,2022.
[20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR).2021.
[21]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:4171-4186.
[22]LI X,LIU X P,LI W C,et al.Survey on contrastive learning research[J].Journal of Chinese Computer Systems,2023,44(4):787-797.
[23]TAN H,BANSAL M.LXMert:learningcross-modality encoder rep- resentations from transformers[C]//2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:5100-5111.
[24]SELVARAJU R R,COGSWELLM,DAS A,et al.Grad-CAM:visual explanations from deep networks via gradient-based loca-lization[C]//16th IEEE International Conference on Computer Vision(ICCV).2017:618-626.
[25]BYUN J,HWANG T,FU J L,et al.GRIT-VLP:grouped mini-batch sampling for efficient vision and language pre-training[C]//17th European Conference on Computer Vision(ECCV).2022:395-412.

Related Articles 15

[1]	WANG Jiabin, LUO Junren, ZHOU Yanzhong, WANG Chao, ZHANG Wanpeng. Survey on Event Extraction Methods:Comparative Analysis of Deep Learning and Pre-training [J]. Computer Science, 2024, 51(9): 196-206.
[2]	ZHANG Tianzhi, ZHOU Gang, LIU Hongbo, LIU Shuo, CHEN Jing. Text-Image Gated Fusion Mechanism for Multimodal Aspect-based Sentiment Analysis [J]. Computer Science, 2024, 51(9): 242-249.
[3]	MO Shuyuan, MENG Zuqiang. Multimodal Sentiment Analysis Model Based on Visual Semantics and Prompt Learning [J]. Computer Science, 2024, 51(9): 250-257.
[4]	LU Xulin, LI Zhihua. IoT Device Recognition Method Combining Multimodal IoT Device Fingerprint and Ensemble Learning [J]. Computer Science, 2024, 51(9): 371-382.
[5]	WANG Chao, TANG Chao, WANG Wenjian, ZHANG Jing. Infrared Human Action Recognition Method Based on Multimodal Attention Network [J]. Computer Science, 2024, 51(8): 232-241.
[6]	YAN Qiuyan, SUN Hao, SI Yuqing, YUAN Guan. Multimodality and Forgetting Mechanisms Model for Knowledge Tracing [J]. Computer Science, 2024, 51(7): 133-139.
[7]	WANG Yifan, ZHANG Xuefang. Modality Fusion Strategy Research Based on Multimodal Video Classification Task [J]. Computer Science, 2024, 51(6A): 230300212-5.
[8]	GUI Haitao, WANG Zhongqing. Personalized Dialogue Response Generation Combined with Conversation State Information [J]. Computer Science, 2024, 51(6A): 230800055-7.
[9]	DING Yi, WANG Zhongqing. Study on Pre-training Tasks for Multi-document Summarization [J]. Computer Science, 2024, 51(6A): 230300160-8.
[10]	YU Bihui, TAN Shuyue, WEI Jingxuan, SUN Linzhuang, BU Liping, ZHAO Yiman. Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning [J]. Computer Science, 2024, 51(6): 198-205.
[11]	SHI Jiyun, ZHANG Chi, WANG Yuqiao, LUO Zhaojing, ZHANG Meihui. Generation of Structured Medical Reports Based on Knowledge Assistance [J]. Computer Science, 2024, 51(6): 317-324.
[12]	ZHANG Zhiyuan, ZHANG Weiyan, SONG Yuqiu, RUAN Tong. Multilingual Event Detection Based on Cross-level and Multi-view Features Fusion [J]. Computer Science, 2024, 51(5): 208-215.
[13]	LI Zichen, YI Xiuwen, CHEN Shun, ZHANG Junbo, LI Tianrui. Government Event Dispatch Approach Based on Deep Multi-view Network [J]. Computer Science, 2024, 51(5): 216-222.
[14]	CHEN Wenzhong, CHEN Hongmei, ZHOU Lihua, FANG Yuan. Time-aware Pre-training Method for Sequence Recommendation [J]. Computer Science, 2024, 51(5): 45-53.
[15]	DUAN Yuxiao, HU Yanli, GUO Hao, TAN Zhen, XIAO Weidong. Study on Improved Fake Information Detection Method Based on Cross-modal CorrelationAmbiguity Learning [J]. Computer Science, 2024, 51(4): 307-313.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Multimodal Pre-training Method for Multi-view Contrastive Learning and Semantic Enhancement

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0