计算机科学 ›› 2020, Vol. 47 ›› Issue (7): 125-129.doi: 10.11896/jsjkx.190700006

• 计算机图形学&多媒体 • 上一篇    下一篇

基于聚类网络的文本-视频特征学习

张衡1, 马明栋2, 王得玉2   

  1. 1 南京邮电大学通信与信息工程学院 南京210003
    2 南京邮电大学地理与生物信息学院 南京210003
  • 收稿日期:2019-06-30 出版日期:2020-07-15 发布日期:2020-07-16
  • 通讯作者: 马明栋(mmdbs@126.com)
  • 作者简介:1217012230@njupt.edu.cn
  • 基金资助:
    江苏省自然科学基金青年基金(BK20140868)

Text-Video Feature Learning Based on Clustering Network

ZHANG Heng1, MA Ming-dong2, WANG De-yu2   

  1. 1 College of Telecommunications & Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
    2 College of Geographical and Biological Information,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
  • Received:2019-06-30 Online:2020-07-15 Published:2020-07-16
  • About author:ZHANG Heng,born in 1994,master,is a member of China Computer Federation.His main research interests include ima-ge processing and deeping learning.
    MA Ming-dong,born in 1964,Ph.D,professor,master supervisor.His main research interests include GIS platform software design and development.
  • Supported by:
    This work was supported by the Youth Fund of Jiangsu Natural Science Foundation (BK20140868)

摘要: 综合理解视频内容和文本语义在很多领域都有着广泛的研究。早期的研究主要是将文本-视频映射到一个公共向量空间,然而这种方法所面临的一个问题是大规模文本-视频数据集不足。由于视频数据存在较大的信息冗余,直接通过3D网络提取整个视频特征会使网络参数较多且实时性较差,不利于执行视频任务。为了解决上述问题,文中通过良好的聚类网络聚合视频局部特征,并可以同时利用图像和视频数据训练网络模型,有效地解决了视频模态缺失问题,同时对比了人脸模态对召回任务的影响。在聚类网络中加入了注意力机制,使得网络更加关注与文本语义强相关的模态,从而提高了文本-视频的相似度值,更有利于提高模型的准确率。实验数据表明,基于聚类网络的文本-视频特征学习可以很好地将文本-视频映射到一个公共向量空间,使具有相近语义的文本和视频距离较近,而不相近的文本和视频距离较远。在MPII和MSR-VTT数据集上,基于文本-视频召回任务来测评模型的性能,相比其他模型,所提模型在两个数据集上进行精度均有提升。实验数据表明,基于聚类网络的文本-特征学习可以很好地将文本-视频映射到一个公共向量空间,从而用于文本-视频召回任务。

关键词: 聚类网络, 模态融合, 视频理解, 召回模型

Abstract: Comprehensive understanding of video content and text semantics has been widely researched in many fields.The early research is mainly to map text-video to a common vector space.However,one of the problems faced by this method is the lack of a large-scale text-video datasets.Because of the large information redundancy of the video data,extracting the whole video feature directly through 3D network will lead to more network parameters and poor real-time performance,which is not conducive to vi-deo tasks.In order to solve the above problems,this paper proposes that the local characteristics of video can be aggregated by good clustering network,and the network model can be trained by image and video datasets at the same time to effectively solve the problem of video modal missing.At the meantime,the influence of face mode on recall task is compared.The attention mechanism is added to the clustering network,which makes the network pay more attention to the modes strongly related to the text semantics,so as to improve the similarity value of the text-video and improve the accuracy of the model.The experimental result shows that text-video feature learning based on clustering network can map text-video to a common vector space,so that text and video with similar semantics are close to each other,text and video with different distances are far away.In this paper,the performance of the text-video recall task evaluation model based on MPII and MSR-VTT datasets is improved compared with other models.From the experimental result,it is fully proved that the text-feature learning based on clustering network can map the text-video to a common vector space,which can be used in the text-video recall task.

Key words: Clustering network, Modal fusion, Recall model, Video understanding

中图分类号: 

  • TP391
[1]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[2]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[3]PAN Y,MEI T,YAO T,et al.Jointly modeling embeddingand translation to bridge video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[4]PLUMMER B A,BROWN M,LAZEBNIK S.Enhancing video summarization via vision-language embedding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[5]XU R,XIONG C,CHEN W,et al.Jointly modeling deepvideoand compositional text to bridge vision and language in a unifiedframework[C]//Proceeding of the Association for the Advance of Artificial Intelligence.2015.
[6]YU H,WANG J,HUANG Z,et al.Video paragraphcaptioning using hierarchical recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[7]TRAN D,BOURDEV L,FERGUS R,et al.Learningspatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2015.
[8]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[9]XU J,MEI T,YAO T,et al.Msr-vtt:A large video descriptiondataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[10]ROHRBACH A,ROHRBACH M,TANDON N,et al.A dataset formovie description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015.
[11]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2009.
[12]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Proceedings of the European Conference on Computer Vision.2014.
[13]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision usingcrowdsourced dense image annotations[C]//Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition.2016.
[14]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation ofword representations in vector space[C]//Proceedings of the Conference of the Computer and Language.2013.
[15]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[16]HE K,ZHANG X,REN S,et al.Deep Residual Learning for ImageRecognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[17]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a newmodel and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[18]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audioclassification[C]//Proceedings of the International Conference on Acoustics,Speech and Signal Processing (ICASSP).2017.
[19]WANG L,LI Y,LAZEBNIK S.Learning deep structure-pre-servingimage-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[20]WANG L,LI Y,HUANG J,et al.Learning two-branch neuralnetworks for image-text matching tasks[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence.2018.
[21]KARPATHY A,JOULIN A,LI F F.Deep fragment embed-dings forbidirectional image sentence mapping[C]//Proceedings of the Conference and Workshop on Neural Information Processing Systems.2014.
[22]YU Y,KO H,CHOI J,et al.Video captioning and retrievalmodels with semantic attention[C]//Proceedings of the European Conference on Computer Vision.2016.
[23]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the International Conference on Lear-ning Representations.2015.
[24]TORABI A,TANDON N,SIGAL L.Learning language-visual embedding for movie understanding with natural-language[C]//Proceedings of the IEEE International Conference on Computer Vision.2016.
[25]MIECH A,ALAYRAC J B,BOJANOWSKI P,et al.Learning from Video and Text via Large-Scale Discriminative Clustering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017.
[26]KLEIN B,LEV G,SADEH G,et al.Associating neural wordembeddings with deep image representations using fisher vectors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015.
[27]YU Y,KIM J,KIM G.Joint sequence fusion model for video question-answering and retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2017.
[28]MIECH A,LAPTEV I,SIVIC J.Learning a Text-Video Embedding from Incomplete and Heterogeneous Data[C]//Procee-dings of the IEEE Computer Vision and Pattern Recognition.2019.
[1] 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙.
基于自然语言的视频片段定位综述
Overview of Natural Language Video Localization
计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130
[2] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[3] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[4] 郭丹, 唐申庚, 洪日昌, 汪萌.
手语识别、翻译与生成综述
Review of Sign Language Recognition, Translation and Generation
计算机科学, 2021, 48(3): 60-70. https://doi.org/10.11896/jsjkx.210100227
[5] 武阿明, 姜品, 韩亚洪.
基于视觉和语言的跨媒体问答与推理研究综述
Survey of Cross-media Question Answering and Reasoning Based on Vision and Language
计算机科学, 2021, 48(3): 71-78. https://doi.org/10.11896/jsjkx.201100176
[6] 王树徽, 闫旭, 黄庆明.
跨媒体分析与推理技术研究综述
Overview of Research on Cross-media Analysis and Reasoning Technology
计算机科学, 2021, 48(3): 79-86. https://doi.org/10.11896/jsjkx.210200086
[7] 樊连玺, 刘彦北, 王雯, 耿磊, 吴骏, 张芳, 肖志涛.
基于多模态表示学习的阿尔兹海默症诊断算法
Multimodal Representation Learning for Alzheimer's Disease Diagnosis
计算机科学, 2021, 48(10): 107-113. https://doi.org/10.11896/jsjkx.200900178
[8] 杨 丹,陈 默,孙良旭,王 刚.
异构信息空间中支持多模态融合实体搜索的多层时态数据模型
Multi-layer Temporal Data Model Supporting Multi-modality Fusion Entity Search in Heterogeneous Information Spaces
计算机科学, 2015, 42(4): 147-150. https://doi.org/10.11896/j.issn.1002-137X.2015.04.029
[9] 柴艳妹,韩文英,刘灿涛,李海峰.
融合理论在步态识别中的应用研究
Study on Application of Fusion Theory in Gait Recognition
计算机科学, 2012, 39(12): 272-277.
[10] 张玉珍,魏带娣,王建宇,戴跃伟.
基于多模态融合的足球视频语义分析
Semantic Analysis for Soccer Video Based on Fusion of Multimodal Features
计算机科学, 2010, 37(7): 273-276.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!