Computer Science ›› 2024, Vol. 51 ›› Issue (1): 168-174.doi: 10.11896/jsjkx.230700084
• Computer Graphics & Multimedia • Previous Articles Next Articles
TANG Jia1, GUO Yan1,2, YE Mingwei1, WU Guixing1,2
CLC Number:
[1]LU J,BATRA D,PARIKH D,et al.ViLBERT:pretrainingtask- agnostic visiolinguistic representations for vision-and-language tasks[C]//33rd Annual Conference on Neural Information Processing Systems.2019:1-11. [2]SU W J,ZHU X Z,CAO Y,et al.Vl-bert:pre-training of generic visual-linguistic representations[C]//8th International Confe-rence on Learning Representations(ICLR).2020. [3]LI X J,YIN X,LI C Y,et al.Oscar:object-semantics alignedpre-training for vision-language tasks[C]//Computer Vision-ECCV 2020:16th European Conference(ECCV).2020:121-137. [4]YAO L,HUANG R,HOU L,et al.FILIP:fine-grained interactive lan guage-image pre-training[C]//10th International Conference on Learning Representations(ICLR).2022. [5]RADFORD A,KIM J W,HALLACY C,et al.Learning transfe-rable visual models from natural language supervision[C]//International Conference on Machine Learning(ICML).2021:8748-8763. [6]LI J N,SELVARAJU R R,GOTMARE A D,et al.Align before Fuse:vision and language representation learning with momentum distillation[C]//35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:9694-9705. [7]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-Image Pre-training with Frozen Image Encoders and Large Language Models[J].arXiv:2301.12597,2023. [8]HUANG S H,DONG L,WANG W H,et al.Language Is Not All You Need:Aligning Perception with Language Models[J].arXiv:2302.14045,2023. [9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:5999-6009. [10]BAO H B,WANG W H,DONG L,et al.Vlmo:unified visual-language pre-training with mixture-of-modality- experts[C]//36st Annual Conference on Neural Information Processing Systems(NeurIPS).2022:32897-32912. [11]JIA C,YANG Y F,XIA Y,et al.Scaling up visual and vision-languagerepresentation learning with noisy text supervision[C]//International Conference on Machine Learning(ICML).2021:4904-4916. [12]YU J,WANG Z,VASUDEVAN V,et al.Coca:contrastive captioners are image-text foundation models[J].arXiv 2205.01917,2022. [13]SINGH A,HU R H,GOSWAMI V,et al.Flava:a foundational language andvision alignment model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15638-15650. [14]TIAN Y L,KRISHNAN D,ISOLA P.Contrastive multiviewcoding[C]//computer Vision-ECCV 2020:16th European Conference(ECCV).2020:776-794. [15]SHAN B,YIN W C,SUN Y,et al.ERNIE-ViL 2.0:multi-view contrastive learning for image-text pre-training[J].arXiv:2209.15270,2022. [16]GAO T Y,YAO X C,CHEN D Q.SimCSE:simple contrastive learning of sentence embeddings[C]//2021 Conference on Empirical Methods in Natural Language Processing(EMNLP).2021:6894-6910. [17]CLARK K,LUONG M T,LE Q V,et al.ELECTRA:pre-trai-ning text encoders as discriminators rather than generators[C]//8th International Conference on Learning Representations(ICLR).2020. [18]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:6077-6086. [19]LIANG X N,CUI C H,WU SZ,et al.Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J].arXiv:2208.11303,2022. [20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR).2021. [21]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:4171-4186. [22]LI X,LIU X P,LI W C,et al.Survey on contrastive learning research[J].Journal of Chinese Computer Systems,2023,44(4):787-797. [23]TAN H,BANSAL M.LXMert:learningcross-modality encoder rep- resentations from transformers[C]//2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:5100-5111. [24]SELVARAJU R R,COGSWELLM,DAS A,et al.Grad-CAM:visual explanations from deep networks via gradient-based loca-lization[C]//16th IEEE International Conference on Computer Vision(ICCV).2017:618-626. [25]BYUN J,HWANG T,FU J L,et al.GRIT-VLP:grouped mini-batch sampling for efficient vision and language pre-training[C]//17th European Conference on Computer Vision(ECCV).2022:395-412. |
[1] | WANG Jiabin, LUO Junren, ZHOU Yanzhong, WANG Chao, ZHANG Wanpeng. Survey on Event Extraction Methods:Comparative Analysis of Deep Learning and Pre-training [J]. Computer Science, 2024, 51(9): 196-206. |
[2] | ZHANG Tianzhi, ZHOU Gang, LIU Hongbo, LIU Shuo, CHEN Jing. Text-Image Gated Fusion Mechanism for Multimodal Aspect-based Sentiment Analysis [J]. Computer Science, 2024, 51(9): 242-249. |
[3] | MO Shuyuan, MENG Zuqiang. Multimodal Sentiment Analysis Model Based on Visual Semantics and Prompt Learning [J]. Computer Science, 2024, 51(9): 250-257. |
[4] | LU Xulin, LI Zhihua. IoT Device Recognition Method Combining Multimodal IoT Device Fingerprint and Ensemble Learning [J]. Computer Science, 2024, 51(9): 371-382. |
[5] | WANG Chao, TANG Chao, WANG Wenjian, ZHANG Jing. Infrared Human Action Recognition Method Based on Multimodal Attention Network [J]. Computer Science, 2024, 51(8): 232-241. |
[6] | YAN Qiuyan, SUN Hao, SI Yuqing, YUAN Guan. Multimodality and Forgetting Mechanisms Model for Knowledge Tracing [J]. Computer Science, 2024, 51(7): 133-139. |
[7] | WANG Yifan, ZHANG Xuefang. Modality Fusion Strategy Research Based on Multimodal Video Classification Task [J]. Computer Science, 2024, 51(6A): 230300212-5. |
[8] | GUI Haitao, WANG Zhongqing. Personalized Dialogue Response Generation Combined with Conversation State Information [J]. Computer Science, 2024, 51(6A): 230800055-7. |
[9] | DING Yi, WANG Zhongqing. Study on Pre-training Tasks for Multi-document Summarization [J]. Computer Science, 2024, 51(6A): 230300160-8. |
[10] | YU Bihui, TAN Shuyue, WEI Jingxuan, SUN Linzhuang, BU Liping, ZHAO Yiman. Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning [J]. Computer Science, 2024, 51(6): 198-205. |
[11] | SHI Jiyun, ZHANG Chi, WANG Yuqiao, LUO Zhaojing, ZHANG Meihui. Generation of Structured Medical Reports Based on Knowledge Assistance [J]. Computer Science, 2024, 51(6): 317-324. |
[12] | ZHANG Zhiyuan, ZHANG Weiyan, SONG Yuqiu, RUAN Tong. Multilingual Event Detection Based on Cross-level and Multi-view Features Fusion [J]. Computer Science, 2024, 51(5): 208-215. |
[13] | LI Zichen, YI Xiuwen, CHEN Shun, ZHANG Junbo, LI Tianrui. Government Event Dispatch Approach Based on Deep Multi-view Network [J]. Computer Science, 2024, 51(5): 216-222. |
[14] | CHEN Wenzhong, CHEN Hongmei, ZHOU Lihua, FANG Yuan. Time-aware Pre-training Method for Sequence Recommendation [J]. Computer Science, 2024, 51(5): 45-53. |
[15] | DUAN Yuxiao, HU Yanli, GUO Hao, TAN Zhen, XIAO Weidong. Study on Improved Fake Information Detection Method Based on Cross-modal CorrelationAmbiguity Learning [J]. Computer Science, 2024, 51(4): 307-313. |
|