计算机科学 ›› 2022, Vol. 49 ›› Issue (7): 106-112.doi: 10.11896/jsjkx.210500224

• 计算机图形学&多媒体 • 上一篇    下一篇

基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨

曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨   

  1. 中国人民解放军国防科技大学第六十三研究所 南京210007
  • 收稿日期:2021-05-31 修回日期:2021-12-15 出版日期:2022-07-15 发布日期:2022-07-12
  • 通讯作者: 曹建军(caojj@nudt.edu.cn)
  • 作者简介:(2604533953@qq.com)
  • 基金资助:
    国家自然科学基金(61371196); 中国博士后科学基金(2015M582832)

Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin   

  1. Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007,China
  • Received:2021-05-31 Revised:2021-12-15 Online:2022-07-15 Published:2022-07-12
  • About author:ZENG Zhi-xian,born in 1996,postgra-duate,is a member of China Computer Federation.His main research interests include data quality control and data governance.
    CAO Jian-jun,born in 1975,Ph.D,associate researcher,master supervisor,is a senior member of China Computer Federation.His main research interests include data quality control,data gover-nance,data intelligence analysis and application.
  • Supported by:
    National Natural Science Foundation of China(61371196) and China Postdoctoral Science Foundation(2015M582832).

摘要: 随着移动网络、自媒体平台的迅速发展,大量的视频和文本信息不断涌现,这给视频-文本数据跨模态实体分辨带来了迫切的现实需求。为提高视频-文本跨模态实体分辨的性能,提出了一种基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨模型(Fine-grained Semantic Association Video-Text Cross-Model Entity Resolution Model Based on Attention Mechanism,FSAAM)。对于视频中的每一帧,利用图像特征提取网络特征信息,并将其作为特征表示,然后通过全连接网络进行微调,将每一帧映射到共同空间;同时,利用词嵌入的方法对文本描述中的词进行向量化处理,通过双向递归神经网络将其映射到共同空间。在此基础上,提出了一种自适应细粒度视频-文本语义关联方法,该方法计算文本描述中的每个词与视频帧的相似度,利用注意力机制进行加权求和,得出视频帧与文本的语义相似度,并过滤与文本语义相似度较低的帧,提高了模型性能。FSAAM主要解决了文本描述的词与视频帧关联程度不同而导致视频-文本跨模态数据语义关联难以构建以及视频冗余帧的问题,在MSR-VTT和VATEX数据集上进行了实验,实验结果验证了所提方法的优越性。

关键词: 共同空间, 跨模态实体分辨, 特征提取, 细粒度, 语义相似度, 注意力机制

Abstract: With the rapid development of mobile network and we-media platform,lots of video and text information are generated,which bring an urgent demand for video-text cross-modal entity resolution.In order to improve the performance of video-text cross-modal entity resolution,a novel fine-grained semantic association video-text cross-model entity resolution model based on attention mechanism(FSAAM) is proposed.For each frame in video,the feature information is extracted by the image feature extraction network as a feature representation,which will be fine-tuned by the fully connected network and mapped to a common space.At the same time,the words in the text description are vectorized by word embedding,and mapped to a common space by the bi-directional recurrent neural network.On this basis,an adaptive fine-grained video-text semantic association method is proposed to calculate the similarity between each word in text and the frame in video.The attention mechanism is used for weighted summation to obtain the semantic similarity between the frame in video and the text description,and frames with small semantic similarity with the text are filtered to improve the model's performance.FSAAM mainly solves the problem that there is a great quantity of redundant information in video and a large number of words with little contribution in text,and it is difficult to construct video-text semantic association due to the different degree of association between words and frames.Experiments on MSR-VTT and VATEX datasets demonstrate the superiority of the proposed method.

Key words: Attention mechanism, Common space, Cross-modal entity resolution, Feature extraction, Fine granularity, Semantic similarity

中图分类号: 

  • TP311
[1]PENG Y X,HUANG X,ZHAO Y Z.An Overview of Cross-media Retrieval:Concepts,Methodologies,Benchmarks and Challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2018,28(9):2372-2385.
[2]LIU S,CHEN Z Z,LIU H Y,et al.User-videoCo-attentionNetwork for Personalized Micro-video Recommendation [C]//Proceedings of World Wide Web Conference.New York:ACM,2019:3020-3026.
[3]SHANG S T,SHI M Y,SHANG W Q,et al.A Micro-video Recommendation System Based on Big Data [C]//Proceedings of International Conference on Computer and InformationScience.Okayama:IEEE,2016:1-5.
[4]PENG Y X,HUANG X.Current Research Status and Prospects on Multimedia Content Understanding[J].Journal of Computer Research and Development,2019,56(1):183-208.
[5]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A newApproach to Cross-Modal Multimedia Retrieval [C]//Procee-dings of the 18th ACM International Conference on Multimedia.Florence,Italy:ACM Press,2010:251-260.
[6]WANG T,LI M.Research on Comment Text Mining Based on LDA Model and Semantic Network[J].Journal of Chongqing Technology and Business University(Natural Science Edition),2019,36(4):9-16.
[7]YALE S,MOHAMMAD S.Polysemous Visual-SemanticEmbedding for Cross-Model Retrieval [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Cambridge:MIT Press,2019:1979-1988.
[8]YAN F,MIKOLAJCZYK K.Deep Correlation for MatchingImages and Text [C]//International Conference on Computer Vision & Pattern Recognition(CVPR).Boston,MA:IEEE,2015:3441-3450.
[9]PENG Y X,QI J W,YUANY X.CM-GANs:Cross-modalGenerative Adversarial Networks for Common Representation Learning[J].ACM Transactions on Multimedia Computing Communications and Applications,2017,15(1):22-31.
[10]JIANG B,YANG J C,LV Z H,et al.Internet Cross-Media Retrieval Based on Deep Learning[J].Journal of Visual Communication and Image Representation,2017,48:356-366.
[11]FROME A,CORRADO G S,SHLENS J,et al.DEVISE:A Deep Visual-Semantic Embedding Model [C]//Advances in Neural Information Processing Systems.ACM,2013:2121-2129.
[12]GU J X,CAI J F,JOTY S R,et al.Look,Imagine and Match:Improving Textual-visual Cross-modal Retrieval with Generative Models [C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition,Piscataway.NJ:IEEE,2018:7181-7189.
[13]LV G J,CAO J J,ZHENG Q B,et al.Cross-Modal Entity Resolution Based on Co-Attentional Generative Adversarial Network [C]//International Conference on Multimedia Systems and Signal Processing.Guangzhou,China:ACM,2019:42-46.
[14]PENG Y X,QI J W,ZHUO Y X.MAVA:Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism[J].IEEE Transactions on Image Processing,2020,29:2728-2741.
[15]LI K P,ZHANG Y L,LI K,et al.Visual Semantic Reasoning for Image-Text Matching [C]//Proceedings of the IEEE International Conference on Computer Vision.Seoul,South Korea:IEEE,2019:4654-4662.
[16]YU Y J,KIM J,KIM G.A Joint Sequence Fusion Model for Video Question Answering and Retrieval [C]//Proceedings of the European Conference on Computer Vision.New York:ACM,2018,471-487.
[17]DONG J F,LI X R,XU C X,et al.Dual Encoding for Zero-Example Video Retrieval [C]//Proceedings of the IEEE Confe-rence on Computer Visong and Pattern Recognition.Long Beach,CA,2019:9346-9355.
[18]CHO K,GULCEHRE C,BOUGARES F,et al.Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation [C]//Conference on Empirical Methods in Natural Language Processing(EMNLP).Berlin:ACM,2014:1724-1734.
[19]XU Y,LIU J P,XIAO Y H,et al.Phrase Mining in Ecommerce Based on Cooperative Training[J].Computer Engineering,2020,46(4):70-76,84.
[20]CHEN S Z,ZHAO Y D,QIN J,et al.Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [C]//Conference on Computer Vision and Pattern Recognition(CVPR).Seattle,WA:IEEE,2020:10635-10644.
[21]WANG B K,YANG Y,XU X,et al.Adversarial Cross-ModalRetrieval [C]//Proceedings of the ACM Multimedia.Mountain View California:ACM,2017:154-162.
[22]XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video Description Dataset for Bridging Video and Language [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV:ACM,2016:5288-5296.
[23]WANG X,WU J W,CHEN J K,et al.VATEX:A Large-scale,High-quality Multilingual Dataset for Video-and-Language Research [C]//Proceedings of the IEEE International Conference on Computer Vision.Seoul,South Korea:IEEE,2019:4580-4590.
[24]ZOPH B,VASUDEVAN V,SHLENS J,et al.Learning Transferable Architectures for Scalable Image Recognition [C]//Conference on Computer Vision and Pattern Recognition.Salt Lake City,UT:IEEE,2018:8697-8710.
[25]KIROS R,SALAHUTDINOV R,RICHARD S Z.UnifyingVisual-Semantic Embeddings with Multimodal Neural Language Models [EB/OL].https://arxiv.org/pdf/1411.2539.pdf.
[26]FARTASH F,DAVID J F,JAMIE R K,et al.VSE++:Improving Visual-Semantic Embeddings with Hard Negatives [C]//Proceedings of the British Machine Vision Conference.New York:ACM,2018:1589-1599.
[27]MITHUN N C,LI JC,METZE F,et al.Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-text Retrie-val[C]//Proceedings of the 2018 ACM on International Confe-rence on Multimedia Retrival.Yokohama,Japan,2018:19-27.
[28]DONG J F,LI X R,SNOEK C G.Predicting Visual Features from Text for Image and Video Caption Retrieval[J].IEEE Transactions on Multimedia,2018,20(12):3377-3388.
[1] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[9] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[10] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[11] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[12] 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚.
融合双向门控循环单元和注意力机制的软件自承认技术债识别方法
Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism
计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[13] 彭双, 伍江江, 陈浩, 杜春, 李军.
基于注意力神经网络的对地观测卫星星上自主任务规划方法
Satellite Onboard Observation Task Planning Based on Attention Neural Network
计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[14] 张源, 康乐, 宫朝辉, 张志鸿.
基于Bi-LSTM的期货市场关联交易行为检测方法
Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM
计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[15] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!