Computer Science ›› 2024, Vol. 51 ›› Issue (1): 168-174.doi: 10.11896/jsjkx.230700084

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Multimodal Pre-training Method for Multi-view Contrastive Learning and Semantic Enhancement

TANG Jia1, GUO Yan1,2, YE Mingwei1, WU Guixing1,2   

  1. 1 School of Software Engineering,University of Science and Technology of China,Hefei 230026,China
    2 Suzhou Institute for Advanced Study,University of Science and Technology of China,Suzhou,Jiangsu 215123,China
  • Received:2023-07-12 Revised:2023-11-19 Online:2024-01-15 Published:2024-01-12
  • About author:TANG Jia,born in 1999,master.His main research interests include multimodal and natural language processing.
    GUO Yan,born in 1981,lecturer.Her main research interests include information security,NLP and blockchain.

Abstract: The visual language pretraining(VLP) model has shown impressive performance on multimodal tasks through con-trastive learning and other methods.However,existing research has overlooked the benefits of multi-view descriptions,andthe importance of semantics and grammar.To address this issue,this paper proposes multi-view learning and semantic enhancement for multimodal pre-training(MulSE),which consists of the following three main components:1)introducing multi-view contrastive learning with a generator in the fused encoder model;2)proposing multimodal text reordering as a novel self-supervised visual language pretraining task;3)increasing and exploring the optimal MLM masking ratio,maximizing the ability to use visual information.By improving the pretraining task and employing multiple optimal strategies,our experiments demonstrate that MulSE enhances intra-modal and inter-modal understanding,improves the comprehension of syntax and semantics within text.With only 4M pre-training data volume,it achieves the results of previous large datasets in the graphic retrieval task,and the valuation result on visual question-answering and visual implicative tasks outperforms the previous comprehension VLP models.

Key words: Computer Version, Multimodal, Pre-training, Multi-view, Comprehension augment

CLC Number: 

  • TP391
[1]LU J,BATRA D,PARIKH D,et al.ViLBERT:pretrainingtask- agnostic visiolinguistic representations for vision-and-language tasks[C]//33rd Annual Conference on Neural Information Processing Systems.2019:1-11.
[2]SU W J,ZHU X Z,CAO Y,et al.Vl-bert:pre-training of generic visual-linguistic representations[C]//8th International Confe-rence on Learning Representations(ICLR).2020.
[3]LI X J,YIN X,LI C Y,et al.Oscar:object-semantics alignedpre-training for vision-language tasks[C]//Computer Vision-ECCV 2020:16th European Conference(ECCV).2020:121-137.
[4]YAO L,HUANG R,HOU L,et al.FILIP:fine-grained interactive lan guage-image pre-training[C]//10th International Conference on Learning Representations(ICLR).2022.
[5]RADFORD A,KIM J W,HALLACY C,et al.Learning transfe-rable visual models from natural language supervision[C]//International Conference on Machine Learning(ICML).2021:8748-8763.
[6]LI J N,SELVARAJU R R,GOTMARE A D,et al.Align before Fuse:vision and language representation learning with momentum distillation[C]//35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:9694-9705.
[7]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-Image Pre-training with Frozen Image Encoders and Large Language Models[J].arXiv:2301.12597,2023.
[8]HUANG S H,DONG L,WANG W H,et al.Language Is Not All You Need:Aligning Perception with Language Models[J].arXiv:2302.14045,2023.
[9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:5999-6009.
[10]BAO H B,WANG W H,DONG L,et al.Vlmo:unified visual-language pre-training with mixture-of-modality- experts[C]//36st Annual Conference on Neural Information Processing Systems(NeurIPS).2022:32897-32912.
[11]JIA C,YANG Y F,XIA Y,et al.Scaling up visual and vision-languagerepresentation learning with noisy text supervision[C]//International Conference on Machine Learning(ICML).2021:4904-4916.
[12]YU J,WANG Z,VASUDEVAN V,et al.Coca:contrastive captioners are image-text foundation models[J].arXiv 2205.01917,2022.
[13]SINGH A,HU R H,GOSWAMI V,et al.Flava:a foundational language andvision alignment model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15638-15650.
[14]TIAN Y L,KRISHNAN D,ISOLA P.Contrastive multiviewcoding[C]//computer Vision-ECCV 2020:16th European Conference(ECCV).2020:776-794.
[15]SHAN B,YIN W C,SUN Y,et al.ERNIE-ViL 2.0:multi-view contrastive learning for image-text pre-training[J].arXiv:2209.15270,2022.
[16]GAO T Y,YAO X C,CHEN D Q.SimCSE:simple contrastive learning of sentence embeddings[C]//2021 Conference on Empirical Methods in Natural Language Processing(EMNLP).2021:6894-6910.
[17]CLARK K,LUONG M T,LE Q V,et al.ELECTRA:pre-trai-ning text encoders as discriminators rather than generators[C]//8th International Conference on Learning Representations(ICLR).2020.
[18]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:6077-6086.
[19]LIANG X N,CUI C H,WU SZ,et al.Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J].arXiv:2208.11303,2022.
[20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR).2021.
[21]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:4171-4186.
[22]LI X,LIU X P,LI W C,et al.Survey on contrastive learning research[J].Journal of Chinese Computer Systems,2023,44(4):787-797.
[23]TAN H,BANSAL M.LXMert:learningcross-modality encoder rep- resentations from transformers[C]//2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:5100-5111.
[24]SELVARAJU R R,COGSWELLM,DAS A,et al.Grad-CAM:visual explanations from deep networks via gradient-based loca-lization[C]//16th IEEE International Conference on Computer Vision(ICCV).2017:618-626.
[25]BYUN J,HWANG T,FU J L,et al.GRIT-VLP:grouped mini-batch sampling for efficient vision and language pre-training[C]//17th European Conference on Computer Vision(ECCV).2022:395-412.
[1] WU Jiawei, FANG Quan, HU Jun, QIAN Shengsheng. Pre-training of Heterogeneous Graph Neural Networks for Multi-label Document Classification [J]. Computer Science, 2024, 51(1): 143-149.
[2] YI Liu, GENG Xinyu, BAI Jing. Hierarchical Multi-label Text Classification Algorithm Based on Parallel Convolutional Network Information Fusion [J]. Computer Science, 2023, 50(9): 278-286.
[3] ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[4] SONG Xinyang, YAN Zhiyuan, SUN Muyi, DAI Linlin, LI Qi, SUN Zhenan. Review of Talking Face Generation [J]. Computer Science, 2023, 50(8): 68-78.
[5] ZHANG Xiao, DONG Hongbin. Lightweight Multi-view Stereo Integrating Coarse Cost Volume and Bilateral Grid [J]. Computer Science, 2023, 50(8): 125-132.
[6] ZHOU Fengfan, LING Hefei, ZHANG Jinyuan, XIA Ziwei, SHI Yuxuan, LI Ping. Facial Physical Adversarial Example Performance Prediction Algorithm Based on Multi-modal Feature Fusion [J]. Computer Science, 2023, 50(8): 280-285.
[7] CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training [J]. Computer Science, 2023, 50(6A): 220800127-5.
[8] QIN Jing, WANG Weibin, ZOU Qijie, WANG Zumin, JI Changqing. Review of 3D Target Detection Methods Based on LiDAR Point Clouds [J]. Computer Science, 2023, 50(6A): 220400214-7.
[9] ZHANG Renbin, ZUO Yicong, ZHOU Zelin, WANG Long, CUI Yuhang. Multimodal Generative Adversarial Networks Based Multivariate Time Series Anomaly Detection [J]. Computer Science, 2023, 50(5): 355-362.
[10] WANG Taiyan, PAN Zulie, YU Lu, SONG Jingbin. Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation [J]. Computer Science, 2023, 50(4): 288-297.
[11] LIU Zhe, YIN Chengfeng, LI Tianrui. Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding [J]. Computer Science, 2023, 50(3): 282-290.
[12] CHEN Zhen, PU Yuanyuan, ZHAO Zhengpeng, XU Dan, QIAN Wenhua. Multimodal Sentiment Analysis Based on Adaptive Gated Information Fusion [J]. Computer Science, 2023, 50(3): 298-306.
[13] SU Qi, WANG Hongling, WANG Zhongqing. Unsupervised Script Summarization Based on Pre-trained Model [J]. Computer Science, 2023, 50(2): 310-316.
[14] FAN Dongxu, GUO Yi. Aspect-based Multimodal Sentiment Analysis Based on Trusted Fine-grained Alignment [J]. Computer Science, 2023, 50(12): 246-254.
[15] LI Xiaopeng, LING Cheng, GAO Jingyang. Mixed Path HMC Sampling Methods for Molecular Tree Spaces [J]. Computer Science, 2023, 50(12): 322-329.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!