计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 207-213.doi: 10.11896/jsjkx.230700212
黄晓飞, 郭卫斌
HUANG Xiaofei, GUO Weibin
摘要: 双编码器模型比融合编码器模型具有更快的推理速度,且能在推理过程中对图像和文本进行预计算。然而,双编码器模型中使用的浅交互模块不足以处理复杂的视觉语言理解任务。针对上述问题,提出了一种新的多模态融合方法。首先,提出一种前交互式桥塔结构(PBTS),在单模态编码器的顶层和跨模态编码器的每层之间建立连接,使得不同语义层次的视觉和文本表示之间能够进行全面、自下而上的交互,从而实现更有效的跨模态对齐和融合。同时,为了更好地学习图像和文本的深度交互,提出了一种两阶段跨模态注意力双蒸馏方法(TCMDD),使用融合编码器模型作为教师模型,在预训练阶段和调优阶段同时对单模态编码器及融合模块的跨模态注意力矩阵进行知识蒸馏。使用400万张图片进行预训练并在3个公开数据集上进行调优来验证该方法的有效性。实验结果表明,所提多模态融合方法在多个视觉语言理解任务中获得了更优的性能。
中图分类号:
[1]KIM W,SON B,KIM I.Vilt:Vision-and-language supervision[C]//International Conference on Machine Learning.PMLR,2021:5583-5594. [2]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020. [3]RADFORD A,KIM J W,HALLACY C,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [4]JIA C,YANG Y,XIA Y,et al.Scaling up visual and vision-language representation learning with noisy text supervision[C]//International Conference on Machine Learning.PMLR,2021:4904-4916. [5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [6]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019. [7]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015. [8]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hintsfor thin deep nets[J].arXiv:1412.6550,2014. [9]ZAGORUYKO S,KOMODAKIS N.Paying more attention toattention:Improving the performance of convolutional neural networks via attention transfer[J].arXiv:1612.03928,2016. [10]LI D,YANG Y,TANG H,et al.VIRT:Improving Representation-based Text Matching via Virtual Interaction[C]//Procee-dings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:914-925. [11]WANG Z,WANG W,ZHU H,et al.Distilled dual-encoder mo-del for vision-language understanding[J].arXiv:2112.08723,2021. [12]LU Y,LIU Y,LIU J,et al.Ernie-search:Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[J].arXiv:2205.09153,2022. [13]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120. [14]CHO J,LEI J,TAN H,et al.Unifying vision-and-language tasks via text generation [C]//International Conference on Machine Learning.PMLR,2021:1931-1942. [15]GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1440-1448. [16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [17]WANG P,YANG A,MEN R,et al.Ofa:Unifying architectures,tasks,and modalities through a simple sequence-to-sequence learning framework [C]//International Conference on Machine Learning.PMLR,2022:23318-23340. [18]WANG Z,YU J,YU A W,et al.Simvlm:Simple visual language model pretraining with weak supervision[J].arXiv:2108.10904,2021. [19]XU X,WU C,ROSENMAN S,et al.Bridgetower:Buildingbridges between encoders in vision-language representation learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:10637-10647. [20]WANG W,WEI F,DONG L,et al.Minilm:Deep self-attention distillation for task-agnostic compression of pre-trained transformers[J].Advances in Neural Information Processing Systems,2020,33:5776-5788. [21]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,Part V 13.Springer International Publishing,2014:740-755. [22]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Vo-lume 1:Long Papers).2018:2556-2565. [23]ORDONEZ V,KULKARNI G,BERG T.Im2text:Describingimages using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151. [24]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73. [25]SUHR A,ZHOU S,ZHANG A,et al.A corpus for reasoningabout natural language grounded in photographs[J].arXiv:1811.00491,2018. [26]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019. [27]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:6904-6913. |
|