Computer Science ›› 2024, Vol. 51 ›› Issue (7): 214-220.doi: 10.11896/jsjkx.230600167
• Computer Graphics & Multimedia • Previous Articles Next Articles
LI Yongjie, QIAN Yi, WEN Yimin
CLC Number:
[1]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].arXiv:1409.3215,2014. [2]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164. [3]KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137. [4]CORNIA M,BARALDI L,CUCCHIARA R.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8307-8316. [5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017. [6]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022. [7]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020. [8]HUANG L,WANG W,CHEN J,et al.Attention on attentionfor image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643. [9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587. [10]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826. [11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057. [12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [13]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015. [14]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [15]LI Y,PAN Y,YAO T,et al.Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:17990-17999. [16]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678. [17]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[J].arXiv:2106.10887,2019. [18]FANG Z J,ZHANG J,LI D D.Image description algorithmbased on spatial and multi-level joint coding[J].Computer Science,2022,49(10):151-158. [19]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336. [20]TOLSTIKHIN I O,HOULSBY N,KOLESNIKOV A,et al.Mlp-mixer:An all-mlp architecture for vision[J].Advances in Neural Information Processing Systems,2021,34:24261-24272. [21]LIU H,DAI Z,SO D,et al.Pay attention to mlps[J].Advances in Neural Information Processing Systems,2021,34:9204-9215. [22]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447. [23]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99. [24]NEUMANN P M,PRAEGER C E.Cyclic matrices over finite fields[J].Journal of the London Mathematical Society,1995,52(2):263-284. [25]VEDANTAM R,ZITNICK C L,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575. [26]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-criticalsequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024. [27]WANG P,NG H T.A beam-search decoder for normalization of social media text with application to machine translation[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2013:471-481. [28]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,September 6-12,2014,Proceedings,Part V 13.Springer International Publishing,2014:740-755. [29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318. [30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL workshop on Intrinsicand Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72. [31]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out.2004:74-81. [32]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,October 11-14,2016,Proceedings,Part V 14.Springer International Publishing,2016:382-398. [33]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014. [35]JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:499-515. [36]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699. [37]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694. [38]LIU M F,SHI Q,NIE L Q.Image Captioning Based on Visual Relevance and Context Dual Attention[J].Journal of Software,2022,33(9):3210-3222. |
[1] | ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110. |
[2] | CHEN Zhang-hui, XIONG Yun. Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate [J]. Computer Science, 2022, 49(6): 180-186. |
[3] | FANG Zhong-jun, ZHANG Jing, LI Dong-dong. Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning [J]. Computer Science, 2022, 49(10): 151-158. |
[4] | WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan. Cross-scale Feature Fusion Self-attention for Image Captioning [J]. Computer Science, 2022, 49(10): 191-197. |
[5] | MIAO Yi, ZHAO Zeng-shun, YANG Yu-lu, XU Ning, YANG Hao-ran, SUN Qian. Survey of Image Captioning Methods [J]. Computer Science, 2020, 47(12): 149-160. |
|