Computer Science ›› 2024, Vol. 51 ›› Issue (1): 168-174.doi: 10.11896/jsjkx.230700084
• Computer Graphics & Multimedia • Previous Articles Next Articles
TANG Jia1, GUO Yan1,2, YE Mingwei1, WU Guixing1,2
CLC Number:
[1]LU J,BATRA D,PARIKH D,et al.ViLBERT:pretrainingtask- agnostic visiolinguistic representations for vision-and-language tasks[C]//33rd Annual Conference on Neural Information Processing Systems.2019:1-11. [2]SU W J,ZHU X Z,CAO Y,et al.Vl-bert:pre-training of generic visual-linguistic representations[C]//8th International Confe-rence on Learning Representations(ICLR).2020. [3]LI X J,YIN X,LI C Y,et al.Oscar:object-semantics alignedpre-training for vision-language tasks[C]//Computer Vision-ECCV 2020:16th European Conference(ECCV).2020:121-137. [4]YAO L,HUANG R,HOU L,et al.FILIP:fine-grained interactive lan guage-image pre-training[C]//10th International Conference on Learning Representations(ICLR).2022. [5]RADFORD A,KIM J W,HALLACY C,et al.Learning transfe-rable visual models from natural language supervision[C]//International Conference on Machine Learning(ICML).2021:8748-8763. [6]LI J N,SELVARAJU R R,GOTMARE A D,et al.Align before Fuse:vision and language representation learning with momentum distillation[C]//35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:9694-9705. [7]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-Image Pre-training with Frozen Image Encoders and Large Language Models[J].arXiv:2301.12597,2023. [8]HUANG S H,DONG L,WANG W H,et al.Language Is Not All You Need:Aligning Perception with Language Models[J].arXiv:2302.14045,2023. [9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:5999-6009. [10]BAO H B,WANG W H,DONG L,et al.Vlmo:unified visual-language pre-training with mixture-of-modality- experts[C]//36st Annual Conference on Neural Information Processing Systems(NeurIPS).2022:32897-32912. [11]JIA C,YANG Y F,XIA Y,et al.Scaling up visual and vision-languagerepresentation learning with noisy text supervision[C]//International Conference on Machine Learning(ICML).2021:4904-4916. [12]YU J,WANG Z,VASUDEVAN V,et al.Coca:contrastive captioners are image-text foundation models[J].arXiv 2205.01917,2022. [13]SINGH A,HU R H,GOSWAMI V,et al.Flava:a foundational language andvision alignment model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15638-15650. [14]TIAN Y L,KRISHNAN D,ISOLA P.Contrastive multiviewcoding[C]//computer Vision-ECCV 2020:16th European Conference(ECCV).2020:776-794. [15]SHAN B,YIN W C,SUN Y,et al.ERNIE-ViL 2.0:multi-view contrastive learning for image-text pre-training[J].arXiv:2209.15270,2022. [16]GAO T Y,YAO X C,CHEN D Q.SimCSE:simple contrastive learning of sentence embeddings[C]//2021 Conference on Empirical Methods in Natural Language Processing(EMNLP).2021:6894-6910. [17]CLARK K,LUONG M T,LE Q V,et al.ELECTRA:pre-trai-ning text encoders as discriminators rather than generators[C]//8th International Conference on Learning Representations(ICLR).2020. [18]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:6077-6086. [19]LIANG X N,CUI C H,WU SZ,et al.Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J].arXiv:2208.11303,2022. [20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR).2021. [21]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:4171-4186. [22]LI X,LIU X P,LI W C,et al.Survey on contrastive learning research[J].Journal of Chinese Computer Systems,2023,44(4):787-797. [23]TAN H,BANSAL M.LXMert:learningcross-modality encoder rep- resentations from transformers[C]//2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:5100-5111. [24]SELVARAJU R R,COGSWELLM,DAS A,et al.Grad-CAM:visual explanations from deep networks via gradient-based loca-lization[C]//16th IEEE International Conference on Computer Vision(ICCV).2017:618-626. [25]BYUN J,HWANG T,FU J L,et al.GRIT-VLP:grouped mini-batch sampling for efficient vision and language pre-training[C]//17th European Conference on Computer Vision(ECCV).2022:395-412. |
[1] | WU Jiawei, FANG Quan, HU Jun, QIAN Shengsheng. Pre-training of Heterogeneous Graph Neural Networks for Multi-label Document Classification [J]. Computer Science, 2024, 51(1): 143-149. |
[2] | YI Liu, GENG Xinyu, BAI Jing. Hierarchical Multi-label Text Classification Algorithm Based on Parallel Convolutional Network Information Fusion [J]. Computer Science, 2023, 50(9): 278-286. |
[3] | ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44. |
[4] | SONG Xinyang, YAN Zhiyuan, SUN Muyi, DAI Linlin, LI Qi, SUN Zhenan. Review of Talking Face Generation [J]. Computer Science, 2023, 50(8): 68-78. |
[5] | ZHANG Xiao, DONG Hongbin. Lightweight Multi-view Stereo Integrating Coarse Cost Volume and Bilateral Grid [J]. Computer Science, 2023, 50(8): 125-132. |
[6] | ZHOU Fengfan, LING Hefei, ZHANG Jinyuan, XIA Ziwei, SHI Yuxuan, LI Ping. Facial Physical Adversarial Example Performance Prediction Algorithm Based on Multi-modal Feature Fusion [J]. Computer Science, 2023, 50(8): 280-285. |
[7] | CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training [J]. Computer Science, 2023, 50(6A): 220800127-5. |
[8] | QIN Jing, WANG Weibin, ZOU Qijie, WANG Zumin, JI Changqing. Review of 3D Target Detection Methods Based on LiDAR Point Clouds [J]. Computer Science, 2023, 50(6A): 220400214-7. |
[9] | ZHANG Renbin, ZUO Yicong, ZHOU Zelin, WANG Long, CUI Yuhang. Multimodal Generative Adversarial Networks Based Multivariate Time Series Anomaly Detection [J]. Computer Science, 2023, 50(5): 355-362. |
[10] | WANG Taiyan, PAN Zulie, YU Lu, SONG Jingbin. Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation [J]. Computer Science, 2023, 50(4): 288-297. |
[11] | LIU Zhe, YIN Chengfeng, LI Tianrui. Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding [J]. Computer Science, 2023, 50(3): 282-290. |
[12] | CHEN Zhen, PU Yuanyuan, ZHAO Zhengpeng, XU Dan, QIAN Wenhua. Multimodal Sentiment Analysis Based on Adaptive Gated Information Fusion [J]. Computer Science, 2023, 50(3): 298-306. |
[13] | SU Qi, WANG Hongling, WANG Zhongqing. Unsupervised Script Summarization Based on Pre-trained Model [J]. Computer Science, 2023, 50(2): 310-316. |
[14] | FAN Dongxu, GUO Yi. Aspect-based Multimodal Sentiment Analysis Based on Trusted Fine-grained Alignment [J]. Computer Science, 2023, 50(12): 246-254. |
[15] | LI Xiaopeng, LING Cheng, GAO Jingyang. Mixed Path HMC Sampling Methods for Molecular Tree Spaces [J]. Computer Science, 2023, 50(12): 322-329. |
|