Computer Science ›› 2022, Vol. 49 ›› Issue (9): 111-122.doi: 10.11896/jsjkx.220500130
• Computer Graphics & Multimedia • Previous Articles Next Articles
NIE Xiu-shan1, PAN Jia-nan1, TAN Zhi-fang1, LIU Xin-fang2, GUO Jie1, YIN Yi-long2
CLC Number:
[1]SHOU Z,WANG D G,CHANG S F.Temporal Action Localiza-tion in Untrimmed Videos via Multi-stage CNNs[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:1049-1058. [2]GAO J,YANG Z,NEVATIA R.Cascaded Boundary Regression for Temporal Action Detection[C]//British Machine Vision Conference.2017:52.1-52.11. [3]CHAO Y W,VIJAYANARASIMHAN S,SEYBOLD B,et al.Rethinking the Faster R-CNN Architecture for Temporal Action Localization[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr).2018:1130-1139. [4]BUCH S,ESCORCIA V,GHANEM B,et al.End-to-End,Single-Stream Temporal Action Detection in Untrimmed Videos[C]//British Machine Vision Conference.2017:93.1-93.12. [5]LONG F C,YAO T,QIU Z F,et al.Gaussian Temporal Awareness Networks for Action Localization[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr 2019).2019:344-353. [6]LI S,TAO Z Q,LI K,et al.Visual to Text:Survey of Image and Video Captioning[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2019,3(4):297-312. [7]WANG B R,MA L,ZHANG W,et al.Reconstruction Network for Video Captioning[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr).2018:7622-7631. [8]PAN Y W,YAO T,LI H Q,et al.Video Captioning with Transferred Semantic Attributes[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(Cvpr 2017).2017:984-992. [9]GAO L L,GUO Z,ZHANG H W,et al.Video Captioning With Attention-Based LSTM and Semantic Consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055. [10]WANG X,CHEN W H,WU J W,et al.Video Captioning via Hierarchical Reinforcement Learning[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr).2018:4213-4222. [11]SHEN Z Q,LI J G,SU Z,et al.Weakly Supervised Dense Video Captioning[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(Cvpr 2017).2017:5159-5167. [12]GAO L L,LI X P,SONG J K,et al.Hierarchical LSTMs with Adaptive Attention for Visual Captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(5):1112-1131. [13]BUCH S,ESCORCIA V,SHEN C Q,et al.SST:Single-Stream Temporal Action Proposals[C]//30th IEEE Conference on Computer Vision and Pattern Recognition(Cvpr 2017).2017:6373-6382. [14]GAO J Y,YANG Z H,SUN C,et al.TURN TAP:TemporalUnit Regression Network for Temporal Action Proposals[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:3648-3656. [15]ESCORCIA V,HEILBRON F C,NIEBLES J C,et al.DAPs:Deep Action Proposals for Action Understanding[J].Computer Vision-ECCV 2016,9907:768-784. [16]GAO L L,LI T,SONG J K,et al.Play and rewind:Context-aware video temporal action proposals[J].Pattern Recognition,2020,107:107477. [17]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:4489-4497. [18]CARREIRA J,ZISSERMAN A.Quo Vadis,Action Recogni-tion? A New Model and the Kinetics Dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:6299-6308. [19]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition [J].arXiv:1409.1556,2014. [20]GIRSHICK R.Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision(ICCV).2015:1440-1448. [21]KIROS R,ZHU Y K,SALAKHUTDINOV R,et al.Skip-Thought Vectors[J].Advances in Neural Information Proces-sing Systems 28(Nips 2015),2015,28:3294-3302. [22]BIRD S,KLEIN E,LOPER E.Natural Language Processingwith Python:Analyzing Text with the Natural Language Toolkit[M].O'Reilly Media,Inc,2009. [23]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543. [24]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [25]CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014. [26]ZHANG Z,LIN Z J,ZHAO Z,et al.Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos[C]//Proceedings of the 42nd International Acm Sigir Conference on Research and Development in Information Retrieval(Sigir'19).2019:655-664. [27]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing Moments in Video with Natural Language[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:5804-5813. [28]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing moments in video with temporal language[J].arXiv:1809.01337,2018. [29]GAO J Y,SUN C,YANG Z H,et al.TALL:Temporal Activity Localization via Language Query[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:5277-5285. [30]LIU M,WANG X,NIE L Q,et al.Attentive Moment Retrieval in Videos[C]//ACM/SIGIR Proceedings 2018.2018:15-24. [31]LIU M,WANG X,NIE L Q,et al.Cross-modal Moment Localization in Videos[C]//Proceedings of the 2018 ACM Multimedia Conference(Mm'18).2018:843-851. [32]WU A,HAN Y.Multi-modal Circulant Fusion for Video-to-Language and Backward[C]//IJCAI.2018:1029-1035. [33]LIU B B,YEUNG S,CHOU E,et al.Temporal Modular Networks for Retrieving Complex Compositional Activities in Vi-deos[C]//Computer Vision-ECCV 2018.11207:569-586. [34]ZHANG S Y,SU J S,LUO J B.Exploiting Temporal Relationships in Video Moment Localization with Natural Language[C]//Proceedings of the 27th ACM International Conference on Multimedia(Mm'19).2019:1230-1238. [35]GE R Z,GAO J Y,CHEN K,et al.MAC:Mining Activity Concepts for Language-based Temporal Localization[J].2019 IEEE Winter Conference on Applications of Computer Vision(WACV).2019:245-253. [36]CHEN S,JIANG Y G.Semantic proposal for activity localization in videos via sentence query[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8199-8206. [37]JIANG B,HUANG X,YANG C,et al.Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention[C]//Icmr'19:Proceedings of the 2019 ACM International Conference on Multimedia Retrieval.2019:217-225. [38]SHAO D,XIONG Y,ZHAO Y,et al.Find and Focus:Retrieve and Localize Video Events with Natural Language Queries[J].Computer Vision-ECCV 2018,2018,11213:202-218. [39]XU H,HE K,PLUMMER B A,et al.Multilevel language and vision integration for text-to-clip retrieval[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9062-9069. [40]NING K,XIE L X,LIU J Z,et al.Interaction-Integrated Net-work for Natural Language Moment Localization[J].IEEE Transactions on Image Processing,2021,30:2538-2548. [41]CHEN J,CHEN X,MA L,et al.Temporally grounding natural sentence in video[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:162-171. [42]LIN Z J,ZHAO Z,ZHANG Z,et al.Moment Retrieval viaCross-Modal Interaction Networks With Query Reconstruction[J].IEEE Transactions on Image Processing,2020,29:3750-3762. [43]LU C,CHEN L,TAN C,et al.Debug:A dense bottom-upgrounding approach for natural language video localization[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJC-NLP).2019:5144-5153. [44]YU A W,DOHAN D,LUONG M T,et al.QANet:Combining Local Convolution with Global Self-Attention for Reading Comprehension[J].arXiv:1804.09541,2018. [45]CHEN L,LU C,TANG S,et al.Rethinking the bottom-upframework for query-based video localization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:10551-10558. [46]ZENG R,XU H,HUANG W,et al.Dense regression network for video grounding[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10287-10296. [47]LIU D,QU X,LIU X Y,et al.Jointly cross-and self-modalgraph attention network for query-based moment localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4070-4078. [48]QU X,TANG P,ZOU Z,et al.Fine-grained iterative attention network for temporal language localization in videos[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4280-4288. [49]HUANG L,WANG W M,CHEN J,et al.Attention on Attention for Image Captioning[C]//2019 IEEE/CVF International Conference on Computer Vision(ICCV 2019).2019:4633-4642. [50]ZHANG D,DAI X,WANG X,et al.Man:Moment alignmentnetwork for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2019:1247-1257. [51]YUAN Y,MA L,WANG J,et al.Semantic conditioned dynamic modulation for temporal sentence grounding in videos[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(5):2725-2741. [52]DAI X Y,SINGH B,NG J Y H,et al.TAN:Temporal Aggregation Network for Dense Multi-label Action Recognition[C]//2019 IEEE Winter Conference on Applications of Computer Vision(WACV).2019:151-160. [53]ZHANG S,PENG H,FU J,et al.Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12870-12877. [54]WANG H,ZHA Z J,CHEN X,et al.Dual path interaction network for video moment localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4116-4124. [55]YUAN Y,MEI T,ZHU W.To find where you talk:Temporal sentence localization in video with attention based location regression[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9159-9166. [56]MUN J,CHO M,HAN B.Local-global video-text interactions for temporal grounding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10810-10819. [57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].arXiv:1810.04805,2018. [58]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[J].arXiv:1610.04325,2016. [59]WANG X,GIRSHICK R,GUPTA A,et al.Non-local neuralnetworks[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2018:7794-7803. [60]CHEN S,JIANG W,LIU W,et al.Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos[C]//Computer Vision-ECCV.2020:333-351. [61]CHEN S,JIANG Y G.Hierarchical visual-textual graph fortemporal activity localization via language[C]//European Conference on Computer Vision.2020:601-618. [62]ZHENG Y T,PAL D K,SAVVIDES M.Ring loss:Convex Feature Normalization for Face Recognition[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:5089-5097. [63]HAN J W,YANG L,ZHANG D W,et al.Reinforcement Cutting-Agent Learning for Video Object Segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:9080-9089. [64]TORRADO R R,BONTRAGER P,TOGELIUS J,et al.Deep reinforcement learning for general video game ai[C]//2018 IEEE Conference on Computational Intelligence and Games(CIG).2018:1-8. [65]RAO Y M,LU J W,ZHOU J.Attention-aware Deep Reinforcement Learning for Video Face Recognition[C]//2017 IEEE International Conference on Computer Vision(ICCV).2017:3951-3960. [66]HE D L,ZHAO X,HUANG J Z,et al.Read,Watch,and Move:Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos[C]//Thirty-Third AAAI Conference on Artificial Intelligence/Thirty-First Innovative Applications of Artificial Intelligence Conference/Ninth AAAI Symposium on Educational Advances in Artificial Intelligence.2019:8393-8400. [67]WANG W N,HUANG Y,WANG L.Language-driven Temporal Activity Localization:A Semantic Matching Reinforcement Learning Model[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2019).2019:334-343. [68]WU J,LI G,LIU S,et al.Tree-structured policy based progressive reinforcement learning for temporally language grounding in video[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12386-12393. [69]HAHN M,KADAV A,REHG J M,et al.Tripping throughtime:Efficient Localization of Activities in Videos[J].arXiv:1904.09936,2019. [70]DHINGRA B,LIU H X,YANG Z L,et al.Gated-AttentionReaders for Text Comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(ACL 2017).2017:1832-1846. [71]CAO D,ZENG Y,WEI X,et al.Adversarial video moment retrieval by jointly modeling ranking and localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:898-906. [72]CAO D,ZENG Y,LIU M,et al.Strong:Spatio-temporal reinforcement learning for cross-modal video moment localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4162-4170. [73]CHEN J,MA L,CHEN X,et al.Localizing natural language in videos[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8175-8182. [74]GHOSH S,AGARWAL A,PAREKH Z,et al.ExCL:Extractive Clip Localization Using Natural Language Descriptions[J].ar-Xiv:1904.02755,2019. [75]WANG J,MA L,JIANG W.Temporally grounding languagequeries in videos by contextual boundary-aware prediction[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:12168-12175. [76]RODRIGUEZ-OPAZO C,MARRESE-TAYLOR E,SALEH FS,et al.Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention[C]//2020 IEEE Winter Conference on Applications of Computer Vision(WACV).2020:2453-2462. [77]HERSHEY J R,OLSEN P A.Approximating the KullbackLeibler divergence between Gaussian mixture models[C]//2007 IEEE International Conference on Acoustics,Speech and Signal Processing-ICASSP'07.2007:IV-317-IV-320. [78]RODRIGUEZ-OPAZO C,MARRESE-TAYLOR E,FERNAN-DO B,et al.DORi:Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video[C]//2021 IEEE Winter Conference on Applications of Computer Vision(WACV 2021).2021:1078-1087. [79]ZHANG H,SUN A,JING W,et al.Span-based localizing network for natural language video localization[J].arXiv:2004.13931,2020. [80]DUAN X,HUANG W,GAN C,et al.Weakly supervised dense event captioning in videos [C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:3063-3073. [81]CHIDUME C.Iterative approximation of fixed points of Lipschitzian strictly pseudocontractive mappings[J].Proceedings of the American Mathematical Society,1987,99(2):283-288. [82]MITHUN N C,PAUL S,ROY-CHOWDHURY A K.Weakly Supervised Video Moment Retrieval From Text Queries[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(Cvpr 2019).2019:11584-11593. [83]GAO M,DAVIS L S,SOCHER R,et al.Wslln:Weakly supervised natural language localization networks[J].arXiv:1909.00239,2019. [84]WU J,LI G,HAN X,et al.Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos[C]//Proceedings of the 28th ACM International Confe-rence on Multimedia.2020:1283-1291. [85]SUTTON R,BARTO A.Reinforcement Learning:An Introduction[M].Massachusetts:MIT Press,1998. [86]LIN Z,ZHAO Z,ZHANG Z,et al.Weakly-supervised video moment retrieval via semantic completion network[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2020:11539-11546. [87]ZHANG Z,LIN Z,ZHAO Z,et al.Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:4098-4106. [88]MA M,YOON S,KIM J,et al.VLANet:Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval[C]//Computer Vision-ECCV.2020:156-171. [89]ZHANG Z,ZHAO Z,LIN Z,et al.Counterfactual contrastive learning for weakly-supervised vision-language grounding [J].Advances in Neural Information Processing Systems,2020,33:18123-18134. [90]SELVARAJU R R,COGSWELL M,DAS A,et al.Grad-cam:Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:618-626. [91]REGNERI M,ROHRBACH M,WETZEL D,et al.Grounding Action Descriptions in Videos[J].Transactions of the Association for Computational Linguistics,2013,1:25-36. [92]HEILBRON F C,ESCORCIA V,GHANEM B,et al.ActivityNet:A Large-Scale Video Benchmark for Human Activity Understanding[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:961-970. [93]THOMEE B,SHAMMA D A,FRIEDLAND G,et al.The New Data and New Challenges in Multimedia Research[J].Communications of the ACM,2015,59(2):64-73. [94]ROHRBACH M,REGNERI M,ANDRILUKA M,et al.Script data for attribute-based recognition of composite activities[C]//European Conference on Computer Vision.2012:144-157. [95]SIGURDSSON G A,VAROL G,WANG X L,et al.Hollywood in Homes:Crowdsourcing Data Collection for Activity Understanding[J].Computer Vision-ECCV 2016,2016,9905:510-526. [96]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344. [97]ZHOU L,PALANGI H,ZHANG L,et al.Unified vision-lan-guage pre-training for image captioning and vqa[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2020:13041-13049. [98]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the 33nd International Conference on Neural Information Processing Systems.2019:13-23. [99]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[J].arXiv:1908.07490,2019. [100]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[J].arXiv:1908.08530,2019. |
[1] | ZHANG Da-lin, ZHANG Zhe-wei, WANG Nan, LIU Ji-qiang. AutoUnit:Automatic Test Generation Based on Active Learning and Prediction Guidance [J]. Computer Science, 2022, 49(11): 39-48. |
[2] | LI Zi-dong, YAO Yi-fei, WANG Wei-wei, ZHAO Rui-lian. Web Application Page Element Recognition and Visual Script Generation Based on Machine Vision [J]. Computer Science, 2022, 49(11): 65-75. |
[3] | YAN Zhen-chao, SHU Wen-hao, XIE Xin. Incremental Feature Selection Algorithm for Dynamic Partially Labeled Hybrid Data [J]. Computer Science, 2022, 49(11): 98-108. |
[4] | FU Kun, GUO Yun-peng, ZHUO Jia-ming, LI Jia-ning, LIU Qi. Semantic Information Enhanced Network Embedding with Completely Imbalanced Labels [J]. Computer Science, 2022, 49(11): 109-116. |
[5] | QIAO Jing-jing, WANG Li. Modeling User Micro-Behavior via Adaptive Multi-Attention Network for Session-based Recommendation [J]. Computer Science, 2022, 49(11): 117-125. |
[6] | MIAO Lan-xin, LEI Yu, ZENG Peng-peng, LI Xiao-yu, SONG Jing-kuan. Granularity-aware and Semantic Aggregation Based Image-Text Retrieval Network [J]. Computer Science, 2022, 49(11): 134-140. |
[7] | ZHENG Shun-yuan, HU Liang-xiao, LYU Xiao-qian, SUN Xin, ZHANG Sheng-ping. Edge Guided Self-correction Skin Detection [J]. Computer Science, 2022, 49(11): 141-147. |
[8] | HE Yu-lin, LI Xu, JIN Yi, HUANG Zhe-xue. Handwritten Character Recognition Based on Decomposition Extreme Learning Machine [J]. Computer Science, 2022, 49(11): 148-155. |
[9] | HE Huang-xing, CHEN Ai-guo, WANG Jiao-long. Handwritten Image Binarization Based on Background Estimation and Local Adaptive Integration [J]. Computer Science, 2022, 49(11): 163-169. |
[10] | ZHANG Yu-xin, CHEN Yi-qiang. Driver Distraction Detection Based on Multi-scale Feature Fusion Network [J]. Computer Science, 2022, 49(11): 170-178. |
[11] | PAN Hui-ping, WANG Min-qin, ZHANG Fu-quan. Traffic Sign Detection and Recognition Method Based on Optimized YOLO-V4 [J]. Computer Science, 2022, 49(11): 179-184. |
[12] | DENG Liang, CAO Cun-gen. Methods of Patent Knowledge Graph Construction [J]. Computer Science, 2022, 49(11): 185-196. |
[13] | CHENG Hua-ling, CHEN Yan-ping, YANG Wei-zhe, QIN Yong-bin, HUANG Rui-zhang. Relation Extraction Based on Multidimensional Semantic Mapping [J]. Computer Science, 2022, 49(11): 206-211. |
[14] | WEI Jun-sheng, LIU Yan, CHEN Jing, DUAN Shun-ran. Universal Multi-class Ensemble Method with Self Adaptive Weights [J]. Computer Science, 2022, 49(11): 212-220. |
[15] | ZHANG Zhou, ZHU Jun-guo, YU Zheng-tao. Incorporating Part of Speech and Tonal Features for Vietnamese Grammatical Error Detection [J]. Computer Science, 2022, 49(11): 221-227. |
|