计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 207-213.doi: 10.11896/jsjkx.230700212

• 人工智能 • 上一篇    下一篇

基于双编码器的多模态融合方法

黄晓飞, 郭卫斌   

  1. 华东理工大学信息科学与工程学院 上海 200237
  • 收稿日期:2023-07-28 修回日期:2023-11-27 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 郭卫斌(gweibin@ecust.edu.cn)
  • 作者简介:(y30211028@mail.ecust.edu.cn)
  • 基金资助:
    几何信息融合的分类学习研究(62076094)

Multi-modal Fusion Method Based on Dual Encoders

HUANG Xiaofei, GUO Weibin   

  1. School of Information science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2023-07-28 Revised:2023-11-27 Online:2024-09-15 Published:2024-09-10
  • About author:HUANG Xiaofei,born in 1994,postgraduate.His main research interests include knowledge distillation and multi-modality.
    GUO Weibin,born in 1968,Ph.D,professor.His main research interests include high performance computing,computer application and software engineering.
  • Supported by:
    Research on Classification Learning of Geometric Information Fusion(62076094).

摘要: 双编码器模型比融合编码器模型具有更快的推理速度,且能在推理过程中对图像和文本进行预计算。然而,双编码器模型中使用的浅交互模块不足以处理复杂的视觉语言理解任务。针对上述问题,提出了一种新的多模态融合方法。首先,提出一种前交互式桥塔结构(PBTS),在单模态编码器的顶层和跨模态编码器的每层之间建立连接,使得不同语义层次的视觉和文本表示之间能够进行全面、自下而上的交互,从而实现更有效的跨模态对齐和融合。同时,为了更好地学习图像和文本的深度交互,提出了一种两阶段跨模态注意力双蒸馏方法(TCMDD),使用融合编码器模型作为教师模型,在预训练阶段和调优阶段同时对单模态编码器及融合模块的跨模态注意力矩阵进行知识蒸馏。使用400万张图片进行预训练并在3个公开数据集上进行调优来验证该方法的有效性。实验结果表明,所提多模态融合方法在多个视觉语言理解任务中获得了更优的性能。

关键词: 多模态融合, 双编码器, 跨模态注意力蒸馏, 桥塔结构

Abstract: The dual encoder model has faster inference speed than the fusion encoder model,and can pre-calculate images and text during the inference process.However,the shallow interaction module used in the dual encoder model is not sufficient to handle complex visual language comprehension tasks.In response to the above issues,this paper proposes a new multi-modal fusion method.Firstly,a pre-interactive bridge tower structure(PBTS) is proposed to establish connections between the top layer of a single mode encoder and each layer of a cross-mode encoder.This enables comprehensive bottom-up interaction between visual and textual representations at different semantic levels,enabling more effective cross-modal alignment and fusion.At the same time,in order to better learn the deep interaction between images and text,a two-stage cross-modal attention double distillation method(TCMDD) is proposed,which uses the fusion encoder model as the teacher model and distills knowledge of the cross-modal attention matrix of the single modal encoder and fusion module simultaneously in the pre-training and tuning stages.Using 4 million images for pre-training and tuning on three public datasets to validate the effectiveness of this method.Experimental results show that the proposed multi-modal fusion method achieves better performance in multiple visual language comprehension tasks.

Key words: Multi-modal fusion, Dual encoder, Cross-modal attention distillation, Bridge tower structure

中图分类号: 

  • TP391.4
[1]KIM W,SON B,KIM I.Vilt:Vision-and-language supervision[C]//International Conference on Machine Learning.PMLR,2021:5583-5594.
[2]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[3]RADFORD A,KIM J W,HALLACY C,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[4]JIA C,YANG Y,XIA Y,et al.Scaling up visual and vision-language representation learning with noisy text supervision[C]//International Conference on Machine Learning.PMLR,2021:4904-4916.
[5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[6]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019.
[7]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[8]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hintsfor thin deep nets[J].arXiv:1412.6550,2014.
[9]ZAGORUYKO S,KOMODAKIS N.Paying more attention toattention:Improving the performance of convolutional neural networks via attention transfer[J].arXiv:1612.03928,2016.
[10]LI D,YANG Y,TANG H,et al.VIRT:Improving Representation-based Text Matching via Virtual Interaction[C]//Procee-dings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:914-925.
[11]WANG Z,WANG W,ZHU H,et al.Distilled dual-encoder mo-del for vision-language understanding[J].arXiv:2112.08723,2021.
[12]LU Y,LIU Y,LIU J,et al.Ernie-search:Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[J].arXiv:2205.09153,2022.
[13]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120.
[14]CHO J,LEI J,TAN H,et al.Unifying vision-and-language tasks via text generation [C]//International Conference on Machine Learning.PMLR,2021:1931-1942.
[15]GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1440-1448.
[16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[17]WANG P,YANG A,MEN R,et al.Ofa:Unifying architectures,tasks,and modalities through a simple sequence-to-sequence learning framework [C]//International Conference on Machine Learning.PMLR,2022:23318-23340.
[18]WANG Z,YU J,YU A W,et al.Simvlm:Simple visual language model pretraining with weak supervision[J].arXiv:2108.10904,2021.
[19]XU X,WU C,ROSENMAN S,et al.Bridgetower:Buildingbridges between encoders in vision-language representation learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:10637-10647.
[20]WANG W,WEI F,DONG L,et al.Minilm:Deep self-attention distillation for task-agnostic compression of pre-trained transformers[J].Advances in Neural Information Processing Systems,2020,33:5776-5788.
[21]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,Part V 13.Springer International Publishing,2014:740-755.
[22]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Vo-lume 1:Long Papers).2018:2556-2565.
[23]ORDONEZ V,KULKARNI G,BERG T.Im2text:Describingimages using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151.
[24]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[25]SUHR A,ZHOU S,ZHANG A,et al.A corpus for reasoningabout natural language grounded in photographs[J].arXiv:1811.00491,2018.
[26]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019.
[27]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:6904-6913.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!