Computer Science ›› 2023, Vol. 50 ›› Issue (8): 99-110.doi: 10.11896/jsjkx.230200091
• Computer Graphics & Multimedia • Previous Articles Next Articles
ZHOU Ziyi1, XIONG Hailing2
CLC Number:
[1]KULKARNI G,PREMRAJ V,DHAR S,et al.BabyTalk:Understanding and Generating Simple Image Descriptions[C]//Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2011:1601-1608. [2]FARHADI A,HEJRATI M,SADEGHI M A,et al.Every Picture Tells a Story:Generating Sentences from Images[C]//European Conference on Computer Vision.Berlin:Springer,2010:15-29. [3]KIROS R,SALAKHUTDINOV R,ZEMEL R.MultimodalNeural Language Models[C]//Proceedings of the 31st International Conference on International Conference on Machine Learning.2014:II-595-II-603. [4]BAI S,AN S.A Survey on Automatic Image Caption Generation[J].Neurocomputing,2018,311:291-304. [5]MIAO Y,ZHAO Z S,YANG Y L,et al.Survey of Image Captioning Methods[J].Computer Science,2020,47(12):149-160. [6]LI Z X,WEI H Y,ZHANG C L,et al.Research Progress onImage Captioning[J].Journal of Computer Research and Deve-lopment,2021,58(9):1951-1974. [7]MING Y,HU N N,FAN C X,et al.Visuals to Text:A Comprehensive Review on Automatic Image Captioning[J].IEEE/CAA Journal of Automatica Sinica,2022,9(8):1339-1365. [8]HOSSAIN Z M,SOHEL F,SHIRATUDDIN M F,et al.AComprehensive Surveys of Deep Learning for Image Captioning[J].ACM Computing Surveys,2019,51(6):1-36. [9]SHI Y L,YANG W Z,DU H X,et al.Overview of Image Captions Based on Deep Learning[J].Acta Electronica Sinica,2021,49(10):2048-2060. [10]STEFANINI M,CORNIA M,BARALDI L,et al.From Show to Tell:A Survey on Image Captioning[C]//IEEE Transactions on Pattern Analysis and Machine Intelligence.2021. [11]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2015:3156-3164. [12]SZEGEDY C,LIU W,JIA Y,et al.Going Deeper with Convolutions[C]//Conference on Computer Vision and Pattern Recognition(CVPR).2015:1-9. [13]MAO J H,XU W,YANG Y,et al.Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)[C]//Interna-tional Conference on Learning Representations.2015. [14]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[C]//Annual Conference on Neural Information Processing Systems.2012. [15]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[C]//International Conference on Learning Representations.2015. [16]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016. [17]CHEN L,ZHANG H W,XIAO J,et al.SCA-CNN:Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:6298-6306. [18]WU Q,SHEN C,LIU L,et al.What Value Do Explicit High Level Concepts Have in Vision to Language Problems[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:203-212. [19]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common Objects in Context[C]//Computer Vision-ECCV 2014.2014:740-755. [20]YOU Q Z,JIN H L,WANG Z W,et al.Image Captioning with Semantic Attention[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:4651-4659. [21]GAN Z,GAN C,HE X D,et al.Semantic Compositional Networks for Visual Captioning[C]//IEEE Conference on Compu-ter Vision and Pattern Recognition.2017:1141-1150. [22]CHEN F H,JI R R,SU J S,et al.StructCap:Structured Semantic Embedding for Image Captioning[C]//Proceedings of the 25th ACM International Conference on Multimedia.2017:46-54. [23]XU K,BA J,KIROS R,et al.Show,Attend and Tell:NeuralImage Caption Generation with Visual Attention[C]//Procee-dings of the 32nd International Conference on International Conference on Machine Learning.2015:2048-2057. [24]YAO T,PAN Y W,LI Y H,et al.Boosting Image Captioning with Attributes[C]//IEEE International Conference on Computer Vision(ICCV).2017:4904-4912. [25]LU J S,XIONG C M,PARIKH D,et al.Knowing When to LookAdaptive Attention via A Visual Sentinel for Image Captioning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:3242-3250. [26]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018. [27]EGLY R,DRIVER J,RAFAL R.Shifting Visual Attention Between Objects and Locations:Evidence From Normal and Parietal Lesion Subjects[J].Journal of Experimental Psychology:General,1994,123(2):161. [28]SCHOLL B J.Objects and attention:the state of the art[J].Cognition,2001,80(1/2):1-46. [29]GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich Feature Hierarchies for Accurate Object Detection andSemantic Segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2014:580-587. [30]KARPATHY A,LI F F.Deep Visual-Semantic Alignments for Generating Image Descriptions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:3128-3137. [31]REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:To-wards Real-Time Object Detection with Region Proposal Networks[C]//Conference and Workshop on Neural Information Processing Systems.2016:1-10. [32]KRISHNA R,ZHU YK,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [33]QIN Y,DU J J,ZHANG Y H,et al.Look Back and Predict Forward in Image Captioning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019:8359-8367. [34]YAO T,PAN Y W,LI Y H,et al.Hierarchy Parsing for Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:2621-2629. [35]DATTA S,SIKKA K,ROY A,et al.Align2Ground:Weakly Su-pervised Phrase Grounding Guided by Image-Caption Alignment[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:2601-2610. [36]LU J S,YANG J W,BATRA D,et al.Neural Baby Talk[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018. [37]YAO T,PAN Y W,LI Y H,et al.Exploring Visual Relationship for Image Captioning[C]//European Conference on Computer Vision.2018. [38]ZHENG Y,LI Y L,WANG S J.Intention Oriented Image Captions with Guiding Objects[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019:8387-8396. [39]LI Y S,YAN B Y,ZHOU J L.Fully Convolutional Image Description Model Based on Semantic Segmentation[J].Computer Engineering and Design,2023,44(1):210-217. [40]PAPINENI K,ROUKOS S,WARD T,et al.BLEU:A Method for Automatic Evaluation of Machine Translation[C]//Procee-dings of the 40thAnnual Meeting of the Association for Computational Linguistics.2002:311-318. [41]BANERJEE S,LAVIE A.METEOR:An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the Association for Computational Linguistics.2005:65-72. [42]LAVIE A,AGARWAL A.Meteor:An Automatic Metric forMT Evaluation with High Levels of Correlation with Human Judgments[C]//Proceedings of the Second Workshop on Statistical Machine Translation.2007:228-231. [43]LIN C Y,OCHF J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-bigram Statistics[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.2004:21-26. [44]VEDANTAM R,ZITNICK C,PARIKH D.CIDEr:Consensus-based Image Description Evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:4566-4575. [45]ANDERSON P,FERNANDO B,JOHNSON M,et al.SPICE:Semantic Propositional Image Caption Evaluation[C]//Compu-ter Vision-ECCV 2016.Springer International Publishing,2016:382-398. [46]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural InformationProcessing Systems.2017:6000-6010. [47]LI G,ZHU L C,LIU P,et al.Entangled Transformer for Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019. [48]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-Me-mory Transformer for Image Captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).2020:10575-10584. [49]JI J Y,LUO Y P,SUN X S,et al.Improving Image Captioning by Leveraging Intra-and Inter-layer Global Representation in Transformer Network[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2021:1655-1663. [50]GUO L T,LIU J,ZHU X X,et al.Normalized and Geometry-Aware Self-Attention Network for Image Captioning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).2020. [51]HUANG L,WANG W M,CHEN J,et al.Attention on Attention for Image Captioning[C]//IEEE International Conference on Computer Vision.2019. [52]ZHUO Y Q,WEI J H,LI Z X.Research on Image Captioning Based on Double Attention Model[J].ACTA ELECTRONICA SINICA,2022,50(5):1123-1130. [53]FANG Z J,ZHANG J,LI D D.Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning[J].Computer Science,2022,49(10):151-158. [54]WANG M Z,JI J Z,JIA A Z,et al.Cross-scale Feature Fusion Self-attention for Image Captioning[J].Computer Science,2022,49(10):191-197. [55]FANG Z Y,WANG J F,HU X W,et al.Injecting Semantic Concepts into End-to-End Image Captioning[C]//Conference on Computer Vision and Pattern Recognition(CVPR).2022. [56]ZHANG X Y,SUN X S,LUO Y P,et al.RSTNet:Captioning with Adaptive Attention on Visual and Non-Visual Words[C]//Conference on Computer Vision and Pattern Recognition(CVPR).2021:15460-15469. [57]YANG X,TANG K H,ZHANG H W,et al.Auto-EncodingScene Graphsfor Image Captioning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019:10677-10686. [58]CORNIA M,BARALDI L,CUCCHIARA R.Show Control and Tell A Framework for Generating Controllable and Grounded Captions[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019. [59]CHEN L,JIANG Z H,XIAO J,et al.Human-like Controllable Image Captioning with Verb-specific Semantic Roles[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:16841-16851. [60]CHEN S Z,JIN Q,WANG P,et al.Say As You Wish Fine-grained Control of Image Caption Generation with Scene Graphs[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2020:9959-9968. [61]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780. [62]YANG Z L,YUAN Y,WU Y X,et al.Review Networks for Caption Generation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2369-2377. [63]KE L,PEI W J,LI R Y,et al.Reflective Decoding Network for Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:8887-8896. [64]SHI Z,ZHOU X,QIU X P,et al.Improving Image Captioning with Better Use of Captions[C]//The Association for Computational Linguistics.2020. [65]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computers,2020,43(9):1624-1640. [66]WANG C,YANG H J,BARTZ C,et al.Image Captioning with Deep Bidirectional LSTMs[C]//Proceedings of the 24th ACM International Conference on Multimedia.2016:988-997. [67]FENG Y,MA L,LIU W,et al.Unsupervised Image Captioning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019:4120-4129. [68]TAI K S,SOCHER R,MANNING C.Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing.2015:1556-1566. [69]DAI B,YE D M,LIN D H.Rethinking the Form of LatentStates in Image Captioning[C]//Computer Vision-ECCV 2018.Springer International Publishing,2018:294-310. [70]MATHEWS A,XIE L,HE X.SentiCap:generating image de-scriptions with sentiments[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:3574-3580. [71]XU Y,LI L,XU H Y,et al.Image Captioning In the Transfor-mer Age[J].arXiv:2204.07374v1,2022, [72]ZHU X X,LI L X,LIU J,et al.Captioning Transformer with Stacked Attention Modules[J].Applied Sciences,2018,8(5):739-749. [73]ZHANG K,LI J H,ZHOU G D.Study on Joint Generation of Bilingual Image Captions[J].Computer Science,2020,47(12):183-189. [74]YAN C G,HAO Y M,LI L,et al.Task-Adaptive Attention for Image Captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(1):43-51. [75]YU J,LI J,YU Z,et al.Multimodal Transformer With Multi-View Visual Representation for Image Captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(12):4467-4480. [76]HERDADE S,KAPPELER A,BOAKYE K,et al.Image Captioning:Transforming Objects into Words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11137-11147. [77]LI X J,YIN X,LI C Y,et al.Oscar:Object-Semantics Aligned Pre-training for Vision-Language Tasks[C]//Computer Vision-ECCV 2020.Springer International Publishing,2020:121-137. [78]DEVLIN J,CHANG M W,LEE K,etal.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Annual Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018. [79]ZHOU L W,PALANGI H,ZHANG L,et al.Unified Vision-Language Pre-Training for Image Captioning and VQA[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:13041-13049. [80]ZHANG P C,LI X J,HU X W,et al.VinVL:Revisiting Visual Representations in Vision-Language Models[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).2021:5575-5584. [81]HU X W,GAN Z,WANG J F,et al.Scaling Up Vision-Lan-guage Pre-training for Image Captioning[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).2021. [82]YI J C,WU C K,ZHANG X C,et al.MICER:A Pre-trained Encoder-Decoder Architecture for Molecular Image Captioning[J].Bioinformatics,2022,38(19):4562-4572. [83]WEI M,CHEN L,JI W,et al.Rethinking the Two-StageFramework for Grounded Situation Recognition[C]//Procee-dings of the AAAI Conference on Artificial Intelligence 2022:2651-2658. [84]ANEJA J,AGRAWAL H,BATRA D,et al.Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019. [85]DESHPANDE A,ANEJA J,WANG L,et al.Fast,Diverse and Accurate Image Captioning Guided By Part-of-Speech[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019. [86]WANG Q Z,CHAN A B.Describing Like Humans:On Diversity in Image Captioning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2019. [87]CHEN T L,ZHANG Z P,YOU Q Z,et al.“Factual” or “Emotional”:Stylized Image Captioning with Adaptive Learning and Attention[C]//Computer Vision- ECCV 2018.Springer International Publishing,2018:527-543. [88]GAN C,GAN Z,HE X,et al.StyleNet:Generating AttractiveVisual Captions with Styles[C]//IEEE Conference on Compu-ter Vision and Pattern Recognition(CVPR).2017:955-964. [89]CHEN Z H,XIONG Y.Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate[J].Computer Science,2022,49(6):180-186. [90]LIN Z H,LI G D,ZENG X J,et al.A Stylized Image Caption Approach Based on Cross-Media Disentangled Representation Learning.Computer Science,2022,45(12):2510-2527. [91]SHETTY R,ROHRBACH M,HENDRICKS L,et al.Speaking the Same Language:Matching Machine to Human Captions by Adversarial Training[C]//IEEE International Conference on Computer Vision(ICCV).2017. [92]DAI B,LIN D,URTASUN R,et al.Towards Diverse and Natu- ral Image Descriptions via a Conditional GAN[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017. [93]LI D Q,HE X D,HUANG Q Y,et al.Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning[C]//Annual Conference and Workshop on Neural Information Processing Systems.2018. [94]GUO L T,LIU J,YAO P,et al.MSCap Multi-Style Image Captioning with Unpaired Stylized Text[C]//IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition(CVPR).2019:4199-4208. [95]JOHNSON M,SCHUSTER M,LE Q,et al.Google’s Multilingual Neural Machine Translation System:Enabling Zero-Shot Translation[J].Transactions of the Association for Computational Linguistics,2016,5(2):339-351. [96]MATHEWS A,XIE L X,HE X M.SemStyle:Learning to Ge-nerate Stylised Image Captions Using Unaligned Text[C]//IEEE/CVF Conference onComputer Vision and Pattern Recognition.2018:8591-8600. [97]CHENG K Z,MA Z,ZONG S,et al.ADS-Cap:A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora[C]//Natural Language Processing and Chinese Computing.Springer International Publishing,2022:736-748. [98]GURARI D,ZHAO Y,ZHANG M,et al.Captioning ImagesTaken by People Who Are Blind[C]//European Conference on Computer Vision.2020:417-434. [99]BEDDIAR DR,OUSSALAH M,SEPPäNEN T.Automatic Captioning for Medical Imaging(MIC):a Rapid Review of Literature[J].Artificial Intelligence Review,2022,56(5):4019-4076. [100]BITEN A,GOMEZ L,RUSIÑOL M,et al.Good News,Everyone! Context Driven Entity-aware Captioning for News Images[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:12466-12475. [101]TAN R,PLUMMER B A,SAENKO K,et al.NewsStories:Illustrating Articles with Visual Summaries[C]//European Conference on Computer Vision.Springer Nature Switzerland,2022:644-661. |
[1] | ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44. |
[2] | SONG Xinyang, YAN Zhiyuan, SUN Muyi, DAI Linlin, LI Qi, SUN Zhenan. Review of Talking Face Generation [J]. Computer Science, 2023, 50(8): 68-78. |
[3] | WANG Xu, WU Yanxia, ZHANG Xue, HONG Ruize, LI Guangsheng. Survey of Rotating Object Detection Research in Computer Vision [J]. Computer Science, 2023, 50(8): 79-92. |
[4] | ZHANG Xiao, DONG Hongbin. Lightweight Multi-view Stereo Integrating Coarse Cost Volume and Bilateral Grid [J]. Computer Science, 2023, 50(8): 125-132. |
[5] | WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259. |
[6] | WANG Mingxia, XIONG Yun. Disease Diagnosis Prediction Algorithm Based on Contrastive Learning [J]. Computer Science, 2023, 50(7): 46-52. |
[7] | SHEN Zhehui, WANG Kailai, KONG Xiangjie. Exploring Station Spatio-Temporal Mobility Pattern:A Short and Long-term Traffic Prediction Framework [J]. Computer Science, 2023, 50(7): 98-106. |
[8] | HUO Weile, JING Tao, REN Shuang. Review of 3D Object Detection for Autonomous Driving [J]. Computer Science, 2023, 50(7): 107-118. |
[9] | ZHOU Bo, JIANG Peifeng, DUAN Chang, LUO Yuetong. Study on Single Background Object Detection Oriented Improved-RetinaNet Model and Its Application [J]. Computer Science, 2023, 50(7): 137-142. |
[10] | MAO Huihui, ZHAO Xiaole, DU Shengdong, TENG Fei, LI Tianrui. Short-term Subway Passenger Flow Forecasting Based on Graphical Embedding of Temporal Knowledge [J]. Computer Science, 2023, 50(7): 213-220. |
[11] | LI Yuqiang, LI Linfeng, ZHU Hao, HOU Mengshu. Deep Learning-based Algorithm for Active IPv6 Address Prediction [J]. Computer Science, 2023, 50(7): 261-269. |
[12] | LI Kun, GUO Wei, ZHANG Fan, DU Jiayu, YANG Meiyue. Adversarial Malware Generation Method Based on Genetic Algorithm [J]. Computer Science, 2023, 50(7): 325-331. |
[13] | LIANG Mingxuan, WANG Shi, ZHU Junwu, LI Yang, GAO Xiang, JIAO Zhixiang. Survey of Knowledge-enhanced Natural Language Generation Research [J]. Computer Science, 2023, 50(6A): 220200120-8. |
[14] | WANG Dongli, YANG Shan, OUYANG Wanli, LI Baopu, ZHOU Yan. Explainability of Artificial Intelligence:Development and Application [J]. Computer Science, 2023, 50(6A): 220600212-7. |
[15] | GAO Xiang, WANG Shi, ZHU Junwu, LIANG Mingxuan, LI Yang, JIAO Zhixiang. Overview of Named Entity Recognition Tasks [J]. Computer Science, 2023, 50(6A): 220200119-8. |
|