Computer Science ›› 2024, Vol. 51 ›› Issue (9): 207-213.doi: 10.11896/jsjkx.230700212

• Artificial Intelligence • Previous Articles     Next Articles

Multi-modal Fusion Method Based on Dual Encoders

HUANG Xiaofei, GUO Weibin   

  1. School of Information science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2023-07-28 Revised:2023-11-27 Online:2024-09-15 Published:2024-09-10
  • About author:HUANG Xiaofei,born in 1994,postgraduate.His main research interests include knowledge distillation and multi-modality.
    GUO Weibin,born in 1968,Ph.D,professor.His main research interests include high performance computing,computer application and software engineering.
  • Supported by:
    Research on Classification Learning of Geometric Information Fusion(62076094).

Abstract: The dual encoder model has faster inference speed than the fusion encoder model,and can pre-calculate images and text during the inference process.However,the shallow interaction module used in the dual encoder model is not sufficient to handle complex visual language comprehension tasks.In response to the above issues,this paper proposes a new multi-modal fusion method.Firstly,a pre-interactive bridge tower structure(PBTS) is proposed to establish connections between the top layer of a single mode encoder and each layer of a cross-mode encoder.This enables comprehensive bottom-up interaction between visual and textual representations at different semantic levels,enabling more effective cross-modal alignment and fusion.At the same time,in order to better learn the deep interaction between images and text,a two-stage cross-modal attention double distillation method(TCMDD) is proposed,which uses the fusion encoder model as the teacher model and distills knowledge of the cross-modal attention matrix of the single modal encoder and fusion module simultaneously in the pre-training and tuning stages.Using 4 million images for pre-training and tuning on three public datasets to validate the effectiveness of this method.Experimental results show that the proposed multi-modal fusion method achieves better performance in multiple visual language comprehension tasks.

Key words: Multi-modal fusion, Dual encoder, Cross-modal attention distillation, Bridge tower structure

CLC Number: 

  • TP391.4
[1]KIM W,SON B,KIM I.Vilt:Vision-and-language supervision[C]//International Conference on Machine Learning.PMLR,2021:5583-5594.
[2]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[3]RADFORD A,KIM J W,HALLACY C,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[4]JIA C,YANG Y,XIA Y,et al.Scaling up visual and vision-language representation learning with noisy text supervision[C]//International Conference on Machine Learning.PMLR,2021:4904-4916.
[5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[6]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019.
[7]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[8]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hintsfor thin deep nets[J].arXiv:1412.6550,2014.
[9]ZAGORUYKO S,KOMODAKIS N.Paying more attention toattention:Improving the performance of convolutional neural networks via attention transfer[J].arXiv:1612.03928,2016.
[10]LI D,YANG Y,TANG H,et al.VIRT:Improving Representation-based Text Matching via Virtual Interaction[C]//Procee-dings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:914-925.
[11]WANG Z,WANG W,ZHU H,et al.Distilled dual-encoder mo-del for vision-language understanding[J].arXiv:2112.08723,2021.
[12]LU Y,LIU Y,LIU J,et al.Ernie-search:Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[J].arXiv:2205.09153,2022.
[13]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120.
[14]CHO J,LEI J,TAN H,et al.Unifying vision-and-language tasks via text generation [C]//International Conference on Machine Learning.PMLR,2021:1931-1942.
[15]GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1440-1448.
[16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[17]WANG P,YANG A,MEN R,et al.Ofa:Unifying architectures,tasks,and modalities through a simple sequence-to-sequence learning framework [C]//International Conference on Machine Learning.PMLR,2022:23318-23340.
[18]WANG Z,YU J,YU A W,et al.Simvlm:Simple visual language model pretraining with weak supervision[J].arXiv:2108.10904,2021.
[19]XU X,WU C,ROSENMAN S,et al.Bridgetower:Buildingbridges between encoders in vision-language representation learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:10637-10647.
[20]WANG W,WEI F,DONG L,et al.Minilm:Deep self-attention distillation for task-agnostic compression of pre-trained transformers[J].Advances in Neural Information Processing Systems,2020,33:5776-5788.
[21]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,Part V 13.Springer International Publishing,2014:740-755.
[22]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Vo-lume 1:Long Papers).2018:2556-2565.
[23]ORDONEZ V,KULKARNI G,BERG T.Im2text:Describingimages using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151.
[24]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[25]SUHR A,ZHOU S,ZHANG A,et al.A corpus for reasoningabout natural language grounded in photographs[J].arXiv:1811.00491,2018.
[26]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019.
[27]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:6904-6913.
[1] HE Shiyang, WANG Zhaohui, GONG Shengrong, ZHONG Shan. Cross-modal Information Filtering-based Networks for Visual Question Answering [J]. Computer Science, 2024, 51(5): 85-91.
[2] WU A-ming, JIANG Pin, HAN Ya-hong. Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J]. Computer Science, 2021, 48(3): 71-78.
[3] WANG Shu-hui, YAN Xu, HUANG Qing-ming. Overview of Research on Cross-media Analysis and Reasoning Technology [J]. Computer Science, 2021, 48(3): 79-86.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!