计算机科学 ›› 2022, Vol. 49 ›› Issue (7): 106-112.doi: 10.11896/jsjkx.210500224
曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨
ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin
摘要: 随着移动网络、自媒体平台的迅速发展,大量的视频和文本信息不断涌现,这给视频-文本数据跨模态实体分辨带来了迫切的现实需求。为提高视频-文本跨模态实体分辨的性能,提出了一种基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨模型(Fine-grained Semantic Association Video-Text Cross-Model Entity Resolution Model Based on Attention Mechanism,FSAAM)。对于视频中的每一帧,利用图像特征提取网络特征信息,并将其作为特征表示,然后通过全连接网络进行微调,将每一帧映射到共同空间;同时,利用词嵌入的方法对文本描述中的词进行向量化处理,通过双向递归神经网络将其映射到共同空间。在此基础上,提出了一种自适应细粒度视频-文本语义关联方法,该方法计算文本描述中的每个词与视频帧的相似度,利用注意力机制进行加权求和,得出视频帧与文本的语义相似度,并过滤与文本语义相似度较低的帧,提高了模型性能。FSAAM主要解决了文本描述的词与视频帧关联程度不同而导致视频-文本跨模态数据语义关联难以构建以及视频冗余帧的问题,在MSR-VTT和VATEX数据集上进行了实验,实验结果验证了所提方法的优越性。
中图分类号:
[1]PENG Y X,HUANG X,ZHAO Y Z.An Overview of Cross-media Retrieval:Concepts,Methodologies,Benchmarks and Challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2018,28(9):2372-2385. [2]LIU S,CHEN Z Z,LIU H Y,et al.User-videoCo-attentionNetwork for Personalized Micro-video Recommendation [C]//Proceedings of World Wide Web Conference.New York:ACM,2019:3020-3026. [3]SHANG S T,SHI M Y,SHANG W Q,et al.A Micro-video Recommendation System Based on Big Data [C]//Proceedings of International Conference on Computer and InformationScience.Okayama:IEEE,2016:1-5. [4]PENG Y X,HUANG X.Current Research Status and Prospects on Multimedia Content Understanding[J].Journal of Computer Research and Development,2019,56(1):183-208. [5]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A newApproach to Cross-Modal Multimedia Retrieval [C]//Procee-dings of the 18th ACM International Conference on Multimedia.Florence,Italy:ACM Press,2010:251-260. [6]WANG T,LI M.Research on Comment Text Mining Based on LDA Model and Semantic Network[J].Journal of Chongqing Technology and Business University(Natural Science Edition),2019,36(4):9-16. [7]YALE S,MOHAMMAD S.Polysemous Visual-SemanticEmbedding for Cross-Model Retrieval [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Cambridge:MIT Press,2019:1979-1988. [8]YAN F,MIKOLAJCZYK K.Deep Correlation for MatchingImages and Text [C]//International Conference on Computer Vision & Pattern Recognition(CVPR).Boston,MA:IEEE,2015:3441-3450. [9]PENG Y X,QI J W,YUANY X.CM-GANs:Cross-modalGenerative Adversarial Networks for Common Representation Learning[J].ACM Transactions on Multimedia Computing Communications and Applications,2017,15(1):22-31. [10]JIANG B,YANG J C,LV Z H,et al.Internet Cross-Media Retrieval Based on Deep Learning[J].Journal of Visual Communication and Image Representation,2017,48:356-366. [11]FROME A,CORRADO G S,SHLENS J,et al.DEVISE:A Deep Visual-Semantic Embedding Model [C]//Advances in Neural Information Processing Systems.ACM,2013:2121-2129. [12]GU J X,CAI J F,JOTY S R,et al.Look,Imagine and Match:Improving Textual-visual Cross-modal Retrieval with Generative Models [C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition,Piscataway.NJ:IEEE,2018:7181-7189. [13]LV G J,CAO J J,ZHENG Q B,et al.Cross-Modal Entity Resolution Based on Co-Attentional Generative Adversarial Network [C]//International Conference on Multimedia Systems and Signal Processing.Guangzhou,China:ACM,2019:42-46. [14]PENG Y X,QI J W,ZHUO Y X.MAVA:Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism[J].IEEE Transactions on Image Processing,2020,29:2728-2741. [15]LI K P,ZHANG Y L,LI K,et al.Visual Semantic Reasoning for Image-Text Matching [C]//Proceedings of the IEEE International Conference on Computer Vision.Seoul,South Korea:IEEE,2019:4654-4662. [16]YU Y J,KIM J,KIM G.A Joint Sequence Fusion Model for Video Question Answering and Retrieval [C]//Proceedings of the European Conference on Computer Vision.New York:ACM,2018,471-487. [17]DONG J F,LI X R,XU C X,et al.Dual Encoding for Zero-Example Video Retrieval [C]//Proceedings of the IEEE Confe-rence on Computer Visong and Pattern Recognition.Long Beach,CA,2019:9346-9355. [18]CHO K,GULCEHRE C,BOUGARES F,et al.Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation [C]//Conference on Empirical Methods in Natural Language Processing(EMNLP).Berlin:ACM,2014:1724-1734. [19]XU Y,LIU J P,XIAO Y H,et al.Phrase Mining in Ecommerce Based on Cooperative Training[J].Computer Engineering,2020,46(4):70-76,84. [20]CHEN S Z,ZHAO Y D,QIN J,et al.Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [C]//Conference on Computer Vision and Pattern Recognition(CVPR).Seattle,WA:IEEE,2020:10635-10644. [21]WANG B K,YANG Y,XU X,et al.Adversarial Cross-ModalRetrieval [C]//Proceedings of the ACM Multimedia.Mountain View California:ACM,2017:154-162. [22]XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video Description Dataset for Bridging Video and Language [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV:ACM,2016:5288-5296. [23]WANG X,WU J W,CHEN J K,et al.VATEX:A Large-scale,High-quality Multilingual Dataset for Video-and-Language Research [C]//Proceedings of the IEEE International Conference on Computer Vision.Seoul,South Korea:IEEE,2019:4580-4590. [24]ZOPH B,VASUDEVAN V,SHLENS J,et al.Learning Transferable Architectures for Scalable Image Recognition [C]//Conference on Computer Vision and Pattern Recognition.Salt Lake City,UT:IEEE,2018:8697-8710. [25]KIROS R,SALAHUTDINOV R,RICHARD S Z.UnifyingVisual-Semantic Embeddings with Multimodal Neural Language Models [EB/OL].https://arxiv.org/pdf/1411.2539.pdf. [26]FARTASH F,DAVID J F,JAMIE R K,et al.VSE++:Improving Visual-Semantic Embeddings with Hard Negatives [C]//Proceedings of the British Machine Vision Conference.New York:ACM,2018:1589-1599. [27]MITHUN N C,LI JC,METZE F,et al.Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-text Retrie-val[C]//Proceedings of the 2018 ACM on International Confe-rence on Multimedia Retrival.Yokohama,Japan,2018:19-27. [28]DONG J F,LI X R,SNOEK C G.Predicting Visual Features from Text for Image and Video Caption Retrieval[J].IEEE Transactions on Multimedia,2018,20(12):3377-3388. |
[1] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[2] | 戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032 |
[3] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[4] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[5] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[6] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[7] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[8] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[9] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[10] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 |
[11] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[12] | 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚. 融合双向门控循环单元和注意力机制的软件自承认技术债识别方法 Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism 计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075 |
[13] | 彭双, 伍江江, 陈浩, 杜春, 李军. 基于注意力神经网络的对地观测卫星星上自主任务规划方法 Satellite Onboard Observation Task Planning Based on Attention Neural Network 计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093 |
[14] | 张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304 |
[15] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
|