计算机科学 ›› 2022, Vol. 49 ›› Issue (9): 111-122.doi: 10.11896/jsjkx.220500130
聂秀山1, 潘嘉男1, 谭智方1, 刘新放2, 郭杰1, 尹义龙2
NIE Xiu-shan1, PAN Jia-nan1, TAN Zhi-fang1, LIU Xin-fang2, GUO Jie1, YIN Yi-long2
摘要: 自然语言视频定位(Natural Language Video Localization,NLVL)是一项新颖而富有挑战性的任务。该任务的目的是根据给定的查询文本从未修剪的视频中找到与这条查询文本语义最为相似的目标片段。与传统的时序动作定位任务不同,NLVL具有更强的灵活性,因为它不受预定义动作列表的限制;同时也更具挑战性,因为NLVL需要从视频和文本两种模态间对齐语义信息。此外,在对齐关系中获取最终的时间戳也是一个艰巨的任务。首先,描述了NLVL的流程;其次,根据是否有监督信息将NLVL算法分为监督方法和弱监督方法两大类并分析其优缺点;然后,总结了常用的数据集和评估指标,对现有的研究进行了总体性能的评估和分析;最后,讨论了技术难点及未来的研究趋势,为今后的工作提供参考。
中图分类号:
[1]SHOU Z,WANG D G,CHANG S F.Temporal Action Localiza-tion in Untrimmed Videos via Multi-stage CNNs[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:1049-1058. [2]GAO J,YANG Z,NEVATIA R.Cascaded Boundary Regression for Temporal Action Detection[C]//British Machine Vision Conference.2017:52.1-52.11. [3]CHAO Y W,VIJAYANARASIMHAN S,SEYBOLD B,et al.Rethinking the Faster R-CNN Architecture for Temporal Action Localization[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr).2018:1130-1139. [4]BUCH S,ESCORCIA V,GHANEM B,et al.End-to-End,Single-Stream Temporal Action Detection in Untrimmed Videos[C]//British Machine Vision Conference.2017:93.1-93.12. [5]LONG F C,YAO T,QIU Z F,et al.Gaussian Temporal Awareness Networks for Action Localization[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr 2019).2019:344-353. [6]LI S,TAO Z Q,LI K,et al.Visual to Text:Survey of Image and Video Captioning[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2019,3(4):297-312. [7]WANG B R,MA L,ZHANG W,et al.Reconstruction Network for Video Captioning[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr).2018:7622-7631. [8]PAN Y W,YAO T,LI H Q,et al.Video Captioning with Transferred Semantic Attributes[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(Cvpr 2017).2017:984-992. [9]GAO L L,GUO Z,ZHANG H W,et al.Video Captioning With Attention-Based LSTM and Semantic Consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055. [10]WANG X,CHEN W H,WU J W,et al.Video Captioning via Hierarchical Reinforcement Learning[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr).2018:4213-4222. [11]SHEN Z Q,LI J G,SU Z,et al.Weakly Supervised Dense Video Captioning[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(Cvpr 2017).2017:5159-5167. [12]GAO L L,LI X P,SONG J K,et al.Hierarchical LSTMs with Adaptive Attention for Visual Captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(5):1112-1131. [13]BUCH S,ESCORCIA V,SHEN C Q,et al.SST:Single-Stream Temporal Action Proposals[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(Cvpr 2017).2017:6373-6382. [14]GAO J Y,YANG Z H,SUN C,et al.TURN TAP:TemporalUnit Regression Network for Temporal Action Proposals[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:3648-3656. [15]ESCORCIA V,HEILBRON F C,NIEBLES J C,et al.DAPs:Deep Action Proposals for Action Understanding[J].Computer Vision-ECCV 2016,9907:768-784. [16]GAO L L,LI T,SONG J K,et al.Play and rewind:Context-aware video temporal action proposals[J].Pattern Recognition,2020,107:107477. [17]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:4489-4497. [18]CARREIRA J,ZISSERMAN A.Quo Vadis,Action Recogni-tion? A New Model and the Kinetics Dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:6299-6308. [19]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition [J].arXiv:1409.1556,2014. [20]GIRSHICK R.Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision(ICCV).2015:1440-1448. [21]KIROS R,ZHU Y K,SALAKHUTDINOV R,et al.Skip-Thought Vectors[J].Advances in Neural Information Proces-sing Systems 28(Nips 2015),2015,28:3294-3302. [22]BIRD S,KLEIN E,LOPER E.Natural Language Processingwith Python:Analyzing Text with the Natural Language Toolkit[M].O'Reilly Media,Inc,2009. [23]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543. [24]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [25]CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014. [26]ZHANG Z,LIN Z J,ZHAO Z,et al.Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos[C]//Proceedings of the 42nd International Acm Sigir Conference on Research and Development in Information Retrieval(Sigir'19).2019:655-664. [27]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing Moments in Video with Natural Language[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:5804-5813. [28]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing moments in video with temporal language[J].arXiv:1809.01337,2018. [29]GAO J Y,SUN C,YANG Z H,et al.TALL:Temporal Activity Localization via Language Query[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:5277-5285. [30]LIU M,WANG X,NIE L Q,et al.Attentive Moment Retrieval in Videos[C]//ACM/SIGIR Proceedings 2018.2018:15-24. [31]LIU M,WANG X,NIE L Q,et al.Cross-modal Moment Localization in Videos[C]//Proceedings of the 2018 ACM Multimedia Conference(Mm'18).2018:843-851. [32]WU A,HAN Y.Multi-modal Circulant Fusion for Video-to-Language and Backward[C]//IJCAI.2018:1029-1035. [33]LIU B B,YEUNG S,CHOU E,et al.Temporal Modular Networks for Retrieving Complex Compositional Activities in Vi-deos[C]//Computer Vision-ECCV 2018.11207:569-586. [34]ZHANG S Y,SU J S,LUO J B.Exploiting Temporal Relationships in Video Moment Localization with Natural Language[C]//Proceedings of the 27th ACM International Conference on Multimedia(Mm'19).2019:1230-1238. [35]GE R Z,GAO J Y,CHEN K,et al.MAC:Mining Activity Concepts for Language-based Temporal Localization[J].2019 IEEE Winter Conference on Applications of Computer Vision(WACV).2019:245-253. [36]CHEN S,JIANG Y G.Semantic proposal for activity localization in videos via sentence query[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8199-8206. [37]JIANG B,HUANG X,YANG C,et al.Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention[C]//Icmr'19:Proceedings of the 2019 ACM International Conference on Multimedia Retrieval.2019:217-225. [38]SHAO D,XIONG Y,ZHAO Y,et al.Find and Focus:Retrieve and Localize Video Events with Natural Language Queries[J].Computer Vision-ECCV 2018,2018,11213:202-218. [39]XU H,HE K,PLUMMER B A,et al.Multilevel language and vision integration for text-to-clip retrieval[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9062-9069. [40]NING K,XIE L X,LIU J Z,et al.Interaction-Integrated Net-work for Natural Language Moment Localization[J].IEEE Transactions on Image Processing,2021,30:2538-2548. [41]CHEN J,CHEN X,MA L,et al.Temporally grounding natural sentence in video[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:162-171. [42]LIN Z J,ZHAO Z,ZHANG Z,et al.Moment Retrieval viaCross-Modal Interaction Networks With Query Reconstruction[J].IEEE Transactions on Image Processing,2020,29:3750-3762. [43]LU C,CHEN L,TAN C,et al.Debug:A dense bottom-upgrounding approach for natural language video localization[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJC-NLP).2019:5144-5153. [44]YU A W,DOHAN D,LUONG M T,et al.QANet:Combining Local Convolution with Global Self-Attention for Reading Comprehension[J].arXiv:1804.09541,2018. [45]CHEN L,LU C,TANG S,et al.Rethinking the bottom-upframework for query-based video localization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:10551-10558. [46]ZENG R,XU H,HUANG W,et al.Dense regression network for video grounding[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10287-10296. [47]LIU D,QU X,LIU X Y,et al.Jointly cross-and self-modalgraph attention network for query-based moment localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4070-4078. [48]QU X,TANG P,ZOU Z,et al.Fine-grained iterative attention network for temporal language localization in videos[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4280-4288. [49]HUANG L,WANG W M,CHEN J,et al.Attention on Attention for Image Captioning[C]//2019 IEEE/CVF International Conference on Computer Vision(ICCV 2019).2019:4633-4642. [50]ZHANG D,DAI X,WANG X,et al.Man:Moment alignmentnetwork for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2019:1247-1257. [51]YUAN Y,MA L,WANG J,et al.Semantic conditioned dynamic modulation for temporal sentence grounding in videos[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(5):2725-2741. [52]DAI X Y,SINGH B,NG J Y H,et al.TAN:Temporal Aggregation Network for Dense Multi-label Action Recognition[C]//2019 IEEE Winter Conference on Applications of Computer Vision(WACV).2019:151-160. [53]ZHANG S,PENG H,FU J,et al.Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12870-12877. [54]WANG H,ZHA Z J,CHEN X,et al.Dual path interaction network for video moment localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4116-4124. [55]YUAN Y,MEI T,ZHU W.To find where you talk:Temporal sentence localization in video with attention based location regression[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9159-9166. [56]MUN J,CHO M,HAN B.Local-global video-text interactions for temporal grounding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10810-10819. [57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].arXiv:1810.04805,2018. [58]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[J].arXiv:1610.04325,2016. [59]WANG X,GIRSHICK R,GUPTA A,et al.Non-local neuralnetworks[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2018:7794-7803. [60]CHEN S,JIANG W,LIU W,et al.Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos[C]//Computer Vision-ECCV.2020:333-351. [61]CHEN S,JIANG Y G.Hierarchical visual-textual graph fortemporal activity localization via language[C]//European Conference on Computer Vision.2020:601-618. [62]ZHENG Y T,PAL D K,SAVVIDES M.Ring loss:Convex Feature Normalization for Face Recognition[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:5089-5097. [63]HAN J W,YANG L,ZHANG D W,et al.Reinforcement Cutting-Agent Learning for Video Object Segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:9080-9089. [64]TORRADO R R,BONTRAGER P,TOGELIUS J,et al.Deep reinforcement learning for general video game ai[C]//2018 IEEE Conference on Computational Intelligence and Games(CIG).2018:1-8. [65]RAO Y M,LU J W,ZHOU J.Attention-aware Deep Reinforcement Learning for Video Face Recognition[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:3951-3960. [66]HE D L,ZHAO X,HUANG J Z,et al.Read,Watch,and Move:Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos[C]//Thirty-Third AAAI Conference on Artificial Intelligence/Thirty-First Innovative Applications of Artificial Intelligence Conference/Ninth AAAI Symposium on Educational Advances in Artificial Intelligence.2019:8393-8400. [67]WANG W N,HUANG Y,WANG L.Language-driven Temporal Activity Localization:A Semantic Matching Reinforcement Learning Model[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).2019:334-343. [68]WU J,LI G,LIU S,et al.Tree-structured policy based progressive reinforcement learning for temporally language grounding in video[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12386-12393. [69]HAHN M,KADAV A,REHG J M,et al.Tripping throughtime:Efficient Localization of Activities in Videos[J].arXiv:1904.09936,2019. [70]DHINGRA B,LIU H X,YANG Z L,et al.Gated-AttentionReaders for Text Comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(ACL 2017).2017:1832-1846. [71]CAO D,ZENG Y,WEI X,et al.Adversarial video moment retrieval by jointly modeling ranking and localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:898-906. [72]CAO D,ZENG Y,LIU M,et al.Strong:Spatio-temporal reinforcement learning for cross-modal video moment localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4162-4170. [73]CHEN J,MA L,CHEN X,et al.Localizing natural language in videos[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8175-8182. [74]GHOSH S,AGARWAL A,PAREKH Z,et al.ExCL:Extractive Clip Localization Using Natural Language Descriptions[J].ar-Xiv:1904.02755,2019. [75]WANG J,MA L,JIANG W.Temporally grounding languagequeries in videos by contextual boundary-aware prediction[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12168-12175. [76]RODRIGUEZ-OPAZO C,MARRESE-TAYLOR E,SALEH FS,et al.Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention[C]//2020 IEEE Winter Conference on Applications of Computer Vision(WACV).2020:2453-2462. [77]HERSHEY J R,OLSEN P A.Approximating the KullbackLeibler divergence between Gaussian mixture models[C]//2007 IEEE International Conference on Acoustics,Speech and Signal Processing-ICASSP'07.2007:IV-317-IV-320. [78]RODRIGUEZ-OPAZO C,MARRESE-TAYLOR E,FERNAN-DO B,et al.DORi:Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video[C]//2021 IEEE Winter Conference on Applications of Computer Vision(WACV 2021).2021:1078-1087. [79]ZHANG H,SUN A,JING W,et al.Span-based localizing network for natural language video localization[J].arXiv:2004.13931,2020. [80]DUAN X,HUANG W,GAN C,et al.Weakly supervised dense event captioning in videos [C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:3063-3073. [81]CHIDUME C.Iterative approximation of fixed points of Lipschitzian strictly pseudocontractive mappings[J].Proceedings of the American Mathematical Society,1987,99(2):283-288. [82]MITHUN N C,PAUL S,ROY-CHOWDHURY A K.Weakly Supervised Video Moment Retrieval From Text Queries[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr 2019).2019:11584-11593. [83]GAO M,DAVIS L S,SOCHER R,et al.Wslln:Weakly supervised natural language localization networks[J].arXiv:1909.00239,2019. [84]WU J,LI G,HAN X,et al.Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos[C]//Proceedings of the 28th ACM International Confe-rence on Multimedia.2020:1283-1291. [85]SUTTON R,BARTO A.Reinforcement Learning:An Introduction[M].Massachusetts:MIT Press,1998. [86]LIN Z,ZHAO Z,ZHANG Z,et al.Weakly-supervised video moment retrieval via semantic completion network[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2020:11539-11546. [87]ZHANG Z,LIN Z,ZHAO Z,et al.Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4098-4106. [88]MA M,YOON S,KIM J,et al.VLANet:Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval[C]//Computer Vision-ECCV.2020:156-171. [89]ZHANG Z,ZHAO Z,LIN Z,et al.Counterfactual contrastive learning for weakly-supervised vision-language grounding [J].Advances in Neural Information Processing Systems,2020,33:18123-18134. [90]SELVARAJU R R,COGSWELL M,DAS A,et al.Grad-cam:Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:618-626. [91]REGNERI M,ROHRBACH M,WETZEL D,et al.Grounding Action Descriptions in Videos[J].Transactions of the Association for Computational Linguistics,2013,1:25-36. [92]HEILBRON F C,ESCORCIA V,GHANEM B,et al.ActivityNet:A Large-Scale Video Benchmark for Human Activity Understanding[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:961-970. [93]THOMEE B,SHAMMA D A,FRIEDLAND G,et al.The New Data and New Challenges in Multimedia Research[J].Communications of the ACM,2015,59(2):64-73. [94]ROHRBACH M,REGNERI M,ANDRILUKA M,et al.Script data for attribute-based recognition of composite activities[C]//European Conference on Computer Vision.2012:144-157. [95]SIGURDSSON G A,VAROL G,WANG X L,et al.Hollywood in Homes:Crowdsourcing Data Collection for Activity Understanding[J].Computer Vision-ECCV 2016,2016,9905:510-526. [96]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344. [97]ZHOU L,PALANGI H,ZHANG L,et al.Unified vision-lan-guage pre-training for image captioning and vqa[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2020:13041-13049. [98]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the 33nd International Conference on Neural Information Processing Systems.2019:13-23. [99]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[J].arXiv:1908.07490,2019. [100]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[J].arXiv:1908.08530,2019. |
[1] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[2] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[3] | 孙圣姿, 郭炳晖, 杨小博. 用于多模态语义分析的嵌入共识自动编码器 Embedding Consensus Autoencoder for Cross-modal Semantic Analysis 计算机科学, 2021, 48(7): 93-98. https://doi.org/10.11896/jsjkx.200600003 |
[4] | 郭丹, 唐申庚, 洪日昌, 汪萌. 手语识别、翻译与生成综述 Review of Sign Language Recognition, Translation and Generation 计算机科学, 2021, 48(3): 60-70. https://doi.org/10.11896/jsjkx.210100227 |
[5] | 张衡, 马明栋, 王得玉. 基于聚类网络的文本-视频特征学习 Text-Video Feature Learning Based on Clustering Network 计算机科学, 2020, 47(7): 125-129. https://doi.org/10.11896/jsjkx.190700006 |
|