Computer Science ›› 2020, Vol. 47 ›› Issue (7): 125-129.doi: 10.11896/jsjkx.190700006

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Text-Video Feature Learning Based on Clustering Network

ZHANG Heng1, MA Ming-dong2, WANG De-yu2   

  1. 1 College of Telecommunications & Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
    2 College of Geographical and Biological Information,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
  • Received:2019-06-30 Online:2020-07-15 Published:2020-07-16
  • About author:ZHANG Heng,born in 1994,master,is a member of China Computer Federation.His main research interests include ima-ge processing and deeping learning.
    MA Ming-dong,born in 1964,Ph.D,professor,master supervisor.His main research interests include GIS platform software design and development.
  • Supported by:
    This work was supported by the Youth Fund of Jiangsu Natural Science Foundation (BK20140868)

Abstract: Comprehensive understanding of video content and text semantics has been widely researched in many fields.The early research is mainly to map text-video to a common vector space.However,one of the problems faced by this method is the lack of a large-scale text-video datasets.Because of the large information redundancy of the video data,extracting the whole video feature directly through 3D network will lead to more network parameters and poor real-time performance,which is not conducive to vi-deo tasks.In order to solve the above problems,this paper proposes that the local characteristics of video can be aggregated by good clustering network,and the network model can be trained by image and video datasets at the same time to effectively solve the problem of video modal missing.At the meantime,the influence of face mode on recall task is compared.The attention mechanism is added to the clustering network,which makes the network pay more attention to the modes strongly related to the text semantics,so as to improve the similarity value of the text-video and improve the accuracy of the model.The experimental result shows that text-video feature learning based on clustering network can map text-video to a common vector space,so that text and video with similar semantics are close to each other,text and video with different distances are far away.In this paper,the performance of the text-video recall task evaluation model based on MPII and MSR-VTT datasets is improved compared with other models.From the experimental result,it is fully proved that the text-feature learning based on clustering network can map the text-video to a common vector space,which can be used in the text-video recall task.

Key words: Clustering network, Modal fusion, Recall model, Video understanding

CLC Number: 

  • TP391
[1]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[2]YU Y,KO H,CHOI J,et al.End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[3]PAN Y,MEI T,YAO T,et al.Jointly modeling embeddingand translation to bridge video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[4]PLUMMER B A,BROWN M,LAZEBNIK S.Enhancing video summarization via vision-language embedding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[5]XU R,XIONG C,CHEN W,et al.Jointly modeling deepvideoand compositional text to bridge vision and language in a unifiedframework[C]//Proceeding of the Association for the Advance of Artificial Intelligence.2015.
[6]YU H,WANG J,HUANG Z,et al.Video paragraphcaptioning using hierarchical recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[7]TRAN D,BOURDEV L,FERGUS R,et al.Learningspatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2015.
[8]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[9]XU J,MEI T,YAO T,et al.Msr-vtt:A large video descriptiondataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[10]ROHRBACH A,ROHRBACH M,TANDON N,et al.A dataset formovie description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015.
[11]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2009.
[12]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Proceedings of the European Conference on Computer Vision.2014.
[13]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision usingcrowdsourced dense image annotations[C]//Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition.2016.
[14]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation ofword representations in vector space[C]//Proceedings of the Conference of the Computer and Language.2013.
[15]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[16]HE K,ZHANG X,REN S,et al.Deep Residual Learning for ImageRecognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[17]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a newmodel and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017.
[18]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audioclassification[C]//Proceedings of the International Conference on Acoustics,Speech and Signal Processing (ICASSP).2017.
[19]WANG L,LI Y,LAZEBNIK S.Learning deep structure-pre-servingimage-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016.
[20]WANG L,LI Y,HUANG J,et al.Learning two-branch neuralnetworks for image-text matching tasks[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence.2018.
[21]KARPATHY A,JOULIN A,LI F F.Deep fragment embed-dings forbidirectional image sentence mapping[C]//Proceedings of the Conference and Workshop on Neural Information Processing Systems.2014.
[22]YU Y,KO H,CHOI J,et al.Video captioning and retrievalmodels with semantic attention[C]//Proceedings of the European Conference on Computer Vision.2016.
[23]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the International Conference on Lear-ning Representations.2015.
[24]TORABI A,TANDON N,SIGAL L.Learning language-visual embedding for movie understanding with natural-language[C]//Proceedings of the IEEE International Conference on Computer Vision.2016.
[25]MIECH A,ALAYRAC J B,BOJANOWSKI P,et al.Learning from Video and Text via Large-Scale Discriminative Clustering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017.
[26]KLEIN B,LEV G,SADEH G,et al.Associating neural wordembeddings with deep image representations using fisher vectors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015.
[27]YU Y,KIM J,KIM G.Joint sequence fusion model for video question-answering and retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2017.
[28]MIECH A,LAPTEV I,SIVIC J.Learning a Text-Video Embedding from Incomplete and Heterogeneous Data[C]//Procee-dings of the IEEE Computer Vision and Pattern Recognition.2019.
[1] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[2] ZHANG Hong-bo, DONG Li-jia, PAN Yu-biao, HSIAO Tsung-chih, ZHANG Hui-zhen, DU Ji-xiang. Survey on Action Quality Assessment Methods in Video Understanding [J]. Computer Science, 2022, 49(7): 79-88.
[3] GUO Dan, TANG Shen-geng, HONG Ri-chang, WANG Meng. Review of Sign Language Recognition, Translation and Generation [J]. Computer Science, 2021, 48(3): 60-70.
[4] WU A-ming, JIANG Pin, HAN Ya-hong. Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J]. Computer Science, 2021, 48(3): 71-78.
[5] WANG Shu-hui, YAN Xu, HUANG Qing-ming. Overview of Research on Cross-media Analysis and Reasoning Technology [J]. Computer Science, 2021, 48(3): 79-86.
[6] FAN Lian-xi, LIU Yan-bei, WANG Wen, GENG Lei, WU Jun, ZHANG Fang, XIAO Zhi-tao. Multimodal Representation Learning for Alzheimer's Disease Diagnosis [J]. Computer Science, 2021, 48(10): 107-113.
[7] YANG Ming-hao,TAO Jian-hua,LI Hao and CHAO Lin-lin. Nature Multimodal Human-Computer-Interaction Dialog System [J]. Computer Science, 2014, 41(10): 12-18.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!