Computer Science ›› 2024, Vol. 51 ›› Issue (9): 207-213.doi: 10.11896/jsjkx.230700212
• Artificial Intelligence • Previous Articles Next Articles
HUANG Xiaofei, GUO Weibin
CLC Number:
[1]KIM W,SON B,KIM I.Vilt:Vision-and-language supervision[C]//International Conference on Machine Learning.PMLR,2021:5583-5594. [2]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020. [3]RADFORD A,KIM J W,HALLACY C,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [4]JIA C,YANG Y,XIA Y,et al.Scaling up visual and vision-language representation learning with noisy text supervision[C]//International Conference on Machine Learning.PMLR,2021:4904-4916. [5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [6]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019. [7]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015. [8]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hintsfor thin deep nets[J].arXiv:1412.6550,2014. [9]ZAGORUYKO S,KOMODAKIS N.Paying more attention toattention:Improving the performance of convolutional neural networks via attention transfer[J].arXiv:1612.03928,2016. [10]LI D,YANG Y,TANG H,et al.VIRT:Improving Representation-based Text Matching via Virtual Interaction[C]//Procee-dings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:914-925. [11]WANG Z,WANG W,ZHU H,et al.Distilled dual-encoder mo-del for vision-language understanding[J].arXiv:2112.08723,2021. [12]LU Y,LIU Y,LIU J,et al.Ernie-search:Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[J].arXiv:2205.09153,2022. [13]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120. [14]CHO J,LEI J,TAN H,et al.Unifying vision-and-language tasks via text generation [C]//International Conference on Machine Learning.PMLR,2021:1931-1942. [15]GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1440-1448. [16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [17]WANG P,YANG A,MEN R,et al.Ofa:Unifying architectures,tasks,and modalities through a simple sequence-to-sequence learning framework [C]//International Conference on Machine Learning.PMLR,2022:23318-23340. [18]WANG Z,YU J,YU A W,et al.Simvlm:Simple visual language model pretraining with weak supervision[J].arXiv:2108.10904,2021. [19]XU X,WU C,ROSENMAN S,et al.Bridgetower:Buildingbridges between encoders in vision-language representation learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:10637-10647. [20]WANG W,WEI F,DONG L,et al.Minilm:Deep self-attention distillation for task-agnostic compression of pre-trained transformers[J].Advances in Neural Information Processing Systems,2020,33:5776-5788. [21]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,Part V 13.Springer International Publishing,2014:740-755. [22]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Vo-lume 1:Long Papers).2018:2556-2565. [23]ORDONEZ V,KULKARNI G,BERG T.Im2text:Describingimages using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151. [24]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Con-necting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73. [25]SUHR A,ZHOU S,ZHANG A,et al.A corpus for reasoningabout natural language grounded in photographs[J].arXiv:1811.00491,2018. [26]XIE N,LAI F,DORAN D,et al.Visual entailment:A novel task for fine-grained image understanding[J].arXiv:1901.06706,2019. [27]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in vi-sual question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:6904-6913. |
[1] | HE Shiyang, WANG Zhaohui, GONG Shengrong, ZHONG Shan. Cross-modal Information Filtering-based Networks for Visual Question Answering [J]. Computer Science, 2024, 51(5): 85-91. |
[2] | WU A-ming, JIANG Pin, HAN Ya-hong. Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J]. Computer Science, 2021, 48(3): 71-78. |
[3] | WANG Shu-hui, YAN Xu, HUANG Qing-ming. Overview of Research on Cross-media Analysis and Reasoning Technology [J]. Computer Science, 2021, 48(3): 79-86. |
|