计算机科学 ›› 2020, Vol. 47 ›› Issue (7): 125-129.doi: 10.11896/jsjkx.190700006
张衡1, 马明栋2, 王得玉2
ZHANG Heng1, MA Ming-dong2, WANG De-yu2
摘要: 综合理解视频内容和文本语义在很多领域都有着广泛的研究。早期的研究主要是将文本-视频映射到一个公共向量空间,然而这种方法所面临的一个问题是大规模文本-视频数据集不足。由于视频数据存在较大的信息冗余,直接通过3D网络提取整个视频特征会使网络参数较多且实时性较差,不利于执行视频任务。为了解决上述问题,文中通过良好的聚类网络聚合视频局部特征,并可以同时利用图像和视频数据训练网络模型,有效地解决了视频模态缺失问题,同时对比了人脸模态对召回任务的影响。在聚类网络中加入了注意力机制,使得网络更加关注与文本语义强相关的模态,从而提高了文本-视频的相似度值,更有利于提高模型的准确率。实验数据表明,基于聚类网络的文本-视频特征学习可以很好地将文本-视频映射到一个公共向量空间,使具有相近语义的文本和视频距离较近,而不相近的文本和视频距离较远。在MPII和MSR-VTT数据集上,基于文本-视频召回任务来测评模型的性能,相比其他模型,所提模型在两个数据集上进行精度均有提升。实验数据表明,基于聚类网络的文本-特征学习可以很好地将文本-视频映射到一个公共向量空间,从而用于文本-视频召回任务。
中图分类号:
[1]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [2]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017. [3]PAN Y,MEI T,YAO T,et al.Jointly modeling embeddingand translation to bridge video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [4]PLUMMER B A,BROWN M,LAZEBNIK S.Enhancing video summarization via vision-language embedding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017. [5]XU R,XIONG C,CHEN W,et al.Jointly modeling deepvideoand compositional text to bridge vision and language in a unifiedframework[C]//Proceeding of the Association for the Advance of Artificial Intelligence.2015. [6]YU H,WANG J,HUANG Z,et al.Video paragraphcaptioning using hierarchical recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [7]TRAN D,BOURDEV L,FERGUS R,et al.Learningspatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2015. [8]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017. [9]XU J,MEI T,YAO T,et al.Msr-vtt:A large video descriptiondataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [10]ROHRBACH A,ROHRBACH M,TANDON N,et al.A dataset formovie description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015. [11]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2009. [12]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Proceedings of the European Conference on Computer Vision.2014. [13]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision usingcrowdsourced dense image annotations[C]//Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition.2016. [14]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation ofword representations in vector space[C]//Proceedings of the Conference of the Computer and Language.2013. [15]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [16]HE K,ZHANG X,REN S,et al.Deep Residual Learning for ImageRecognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [17]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a newmodel and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017. [18]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audioclassification[C]//Proceedings of the International Conference on Acoustics,Speech and Signal Processing (ICASSP).2017. [19]WANG L,LI Y,LAZEBNIK S.Learning deep structure-pre-servingimage-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. [20]WANG L,LI Y,HUANG J,et al.Learning two-branch neuralnetworks for image-text matching tasks[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence.2018. [21]KARPATHY A,JOULIN A,LI F F.Deep fragment embed-dings forbidirectional image sentence mapping[C]//Proceedings of the Conference and Workshop on Neural Information Processing Systems.2014. [22]YU Y,KO H,CHOI J,et al.Video captioning and retrievalmodels with semantic attention[C]//Proceedings of the European Conference on Computer Vision.2016. [23]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the International Conference on Lear-ning Representations.2015. [24]TORABI A,TANDON N,SIGAL L.Learning language-visual embedding for movie understanding with natural-language[C]//Proceedings of the IEEE International Conference on Computer Vision.2016. [25]MIECH A,ALAYRAC J B,BOJANOWSKI P,et al.Learning from Video and Text via Large-Scale Discriminative Clustering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017. [26]KLEIN B,LEV G,SADEH G,et al.Associating neural wordembeddings with deep image representations using fisher vectors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015. [27]YU Y,KIM J,KIM G.Joint sequence fusion model for video question-answering and retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2017. [28]MIECH A,LAPTEV I,SIVIC J.Learning a Text-Video Embedding from Incomplete and Heterogeneous Data[C]//Procee-dings of the IEEE Computer Vision and Pattern Recognition.2019. |
[1] | 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙. 基于自然语言的视频片段定位综述 Overview of Natural Language Video Localization 计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130 |
[2] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[3] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[4] | 郭丹, 唐申庚, 洪日昌, 汪萌. 手语识别、翻译与生成综述 Review of Sign Language Recognition, Translation and Generation 计算机科学, 2021, 48(3): 60-70. https://doi.org/10.11896/jsjkx.210100227 |
[5] | 武阿明, 姜品, 韩亚洪. 基于视觉和语言的跨媒体问答与推理研究综述 Survey of Cross-media Question Answering and Reasoning Based on Vision and Language 计算机科学, 2021, 48(3): 71-78. https://doi.org/10.11896/jsjkx.201100176 |
[6] | 王树徽, 闫旭, 黄庆明. 跨媒体分析与推理技术研究综述 Overview of Research on Cross-media Analysis and Reasoning Technology 计算机科学, 2021, 48(3): 79-86. https://doi.org/10.11896/jsjkx.210200086 |
[7] | 樊连玺, 刘彦北, 王雯, 耿磊, 吴骏, 张芳, 肖志涛. 基于多模态表示学习的阿尔兹海默症诊断算法 Multimodal Representation Learning for Alzheimer's Disease Diagnosis 计算机科学, 2021, 48(10): 107-113. https://doi.org/10.11896/jsjkx.200900178 |
[8] | 杨 丹,陈 默,孙良旭,王 刚. 异构信息空间中支持多模态融合实体搜索的多层时态数据模型 Multi-layer Temporal Data Model Supporting Multi-modality Fusion Entity Search in Heterogeneous Information Spaces 计算机科学, 2015, 42(4): 147-150. https://doi.org/10.11896/j.issn.1002-137X.2015.04.029 |
[9] | 柴艳妹,韩文英,刘灿涛,李海峰. 融合理论在步态识别中的应用研究 Study on Application of Fusion Theory in Gait Recognition 计算机科学, 2012, 39(12): 272-277. |
[10] | 张玉珍,魏带娣,王建宇,戴跃伟. 基于多模态融合的足球视频语义分析 Semantic Analysis for Soccer Video Based on Fusion of Multimodal Features 计算机科学, 2010, 37(7): 273-276. |
|