Computer Science ›› 2022, Vol. 49 ›› Issue (7): 106-112.doi: 10.11896/jsjkx.210500224

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin   

  1. Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007,China
  • Received:2021-05-31 Revised:2021-12-15 Online:2022-07-15 Published:2022-07-12
  • About author:ZENG Zhi-xian,born in 1996,postgra-duate,is a member of China Computer Federation.His main research interests include data quality control and data governance.
    CAO Jian-jun,born in 1975,Ph.D,associate researcher,master supervisor,is a senior member of China Computer Federation.His main research interests include data quality control,data gover-nance,data intelligence analysis and application.
  • Supported by:
    National Natural Science Foundation of China(61371196) and China Postdoctoral Science Foundation(2015M582832).

Abstract: With the rapid development of mobile network and we-media platform,lots of video and text information are generated,which bring an urgent demand for video-text cross-modal entity resolution.In order to improve the performance of video-text cross-modal entity resolution,a novel fine-grained semantic association video-text cross-model entity resolution model based on attention mechanism(FSAAM) is proposed.For each frame in video,the feature information is extracted by the image feature extraction network as a feature representation,which will be fine-tuned by the fully connected network and mapped to a common space.At the same time,the words in the text description are vectorized by word embedding,and mapped to a common space by the bi-directional recurrent neural network.On this basis,an adaptive fine-grained video-text semantic association method is proposed to calculate the similarity between each word in text and the frame in video.The attention mechanism is used for weighted summation to obtain the semantic similarity between the frame in video and the text description,and frames with small semantic similarity with the text are filtered to improve the model's performance.FSAAM mainly solves the problem that there is a great quantity of redundant information in video and a large number of words with little contribution in text,and it is difficult to construct video-text semantic association due to the different degree of association between words and frames.Experiments on MSR-VTT and VATEX datasets demonstrate the superiority of the proposed method.

Key words: Attention mechanism, Common space, Cross-modal entity resolution, Feature extraction, Fine granularity, Semantic similarity

CLC Number: 

  • TP311
[1]PENG Y X,HUANG X,ZHAO Y Z.An Overview of Cross-media Retrieval:Concepts,Methodologies,Benchmarks and Challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2018,28(9):2372-2385.
[2]LIU S,CHEN Z Z,LIU H Y,et al.User-videoCo-attentionNetwork for Personalized Micro-video Recommendation [C]//Proceedings of World Wide Web Conference.New York:ACM,2019:3020-3026.
[3]SHANG S T,SHI M Y,SHANG W Q,et al.A Micro-video Recommendation System Based on Big Data [C]//Proceedings of International Conference on Computer and InformationScience.Okayama:IEEE,2016:1-5.
[4]PENG Y X,HUANG X.Current Research Status and Prospects on Multimedia Content Understanding[J].Journal of Computer Research and Development,2019,56(1):183-208.
[5]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A newApproach to Cross-Modal Multimedia Retrieval [C]//Procee-dings of the 18th ACM International Conference on Multimedia.Florence,Italy:ACM Press,2010:251-260.
[6]WANG T,LI M.Research on Comment Text Mining Based on LDA Model and Semantic Network[J].Journal of Chongqing Technology and Business University(Natural Science Edition),2019,36(4):9-16.
[7]YALE S,MOHAMMAD S.Polysemous Visual-SemanticEmbedding for Cross-Model Retrieval [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Cambridge:MIT Press,2019:1979-1988.
[8]YAN F,MIKOLAJCZYK K.Deep Correlation for MatchingImages and Text [C]//International Conference on Computer Vision & Pattern Recognition(CVPR).Boston,MA:IEEE,2015:3441-3450.
[9]PENG Y X,QI J W,YUANY X.CM-GANs:Cross-modalGenerative Adversarial Networks for Common Representation Learning[J].ACM Transactions on Multimedia Computing Communications and Applications,2017,15(1):22-31.
[10]JIANG B,YANG J C,LV Z H,et al.Internet Cross-Media Retrieval Based on Deep Learning[J].Journal of Visual Communication and Image Representation,2017,48:356-366.
[11]FROME A,CORRADO G S,SHLENS J,et al.DEVISE:A Deep Visual-Semantic Embedding Model [C]//Advances in Neural Information Processing Systems.ACM,2013:2121-2129.
[12]GU J X,CAI J F,JOTY S R,et al.Look,Imagine and Match:Improving Textual-visual Cross-modal Retrieval with Generative Models [C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition,Piscataway.NJ:IEEE,2018:7181-7189.
[13]LV G J,CAO J J,ZHENG Q B,et al.Cross-Modal Entity Resolution Based on Co-Attentional Generative Adversarial Network [C]//International Conference on Multimedia Systems and Signal Processing.Guangzhou,China:ACM,2019:42-46.
[14]PENG Y X,QI J W,ZHUO Y X.MAVA:Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism[J].IEEE Transactions on Image Processing,2020,29:2728-2741.
[15]LI K P,ZHANG Y L,LI K,et al.Visual Semantic Reasoning for Image-Text Matching [C]//Proceedings of the IEEE International Conference on Computer Vision.Seoul,South Korea:IEEE,2019:4654-4662.
[16]YU Y J,KIM J,KIM G.A Joint Sequence Fusion Model for Video Question Answering and Retrieval [C]//Proceedings of the European Conference on Computer Vision.New York:ACM,2018,471-487.
[17]DONG J F,LI X R,XU C X,et al.Dual Encoding for Zero-Example Video Retrieval [C]//Proceedings of the IEEE Confe-rence on Computer Visong and Pattern Recognition.Long Beach,CA,2019:9346-9355.
[18]CHO K,GULCEHRE C,BOUGARES F,et al.Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation [C]//Conference on Empirical Methods in Natural Language Processing(EMNLP).Berlin:ACM,2014:1724-1734.
[19]XU Y,LIU J P,XIAO Y H,et al.Phrase Mining in Ecommerce Based on Cooperative Training[J].Computer Engineering,2020,46(4):70-76,84.
[20]CHEN S Z,ZHAO Y D,QIN J,et al.Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [C]//Conference on Computer Vision and Pattern Recognition(CVPR).Seattle,WA:IEEE,2020:10635-10644.
[21]WANG B K,YANG Y,XU X,et al.Adversarial Cross-ModalRetrieval [C]//Proceedings of the ACM Multimedia.Mountain View California:ACM,2017:154-162.
[22]XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video Description Dataset for Bridging Video and Language [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV:ACM,2016:5288-5296.
[23]WANG X,WU J W,CHEN J K,et al.VATEX:A Large-scale,High-quality Multilingual Dataset for Video-and-Language Research [C]//Proceedings of the IEEE International Conference on Computer Vision.Seoul,South Korea:IEEE,2019:4580-4590.
[24]ZOPH B,VASUDEVAN V,SHLENS J,et al.Learning Transferable Architectures for Scalable Image Recognition [C]//Conference on Computer Vision and Pattern Recognition.Salt Lake City,UT:IEEE,2018:8697-8710.
[25]KIROS R,SALAHUTDINOV R,RICHARD S Z.UnifyingVisual-Semantic Embeddings with Multimodal Neural Language Models [EB/OL].
[26]FARTASH F,DAVID J F,JAMIE R K,et al.VSE++:Improving Visual-Semantic Embeddings with Hard Negatives [C]//Proceedings of the British Machine Vision Conference.New York:ACM,2018:1589-1599.
[27]MITHUN N C,LI JC,METZE F,et al.Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-text Retrie-val[C]//Proceedings of the 2018 ACM on International Confe-rence on Multimedia Retrival.Yokohama,Japan,2018:19-27.
[28]DONG J F,LI X R,SNOEK C G.Predicting Visual Features from Text for Image and Video Caption Retrieval[J].IEEE Transactions on Multimedia,2018,20(12):3377-3388.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[3] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[4] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[5] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[6] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[8] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[9] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[10] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[11] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[12] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[13] PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[14] ZHANG Yuan, KANG Le, GONG Zhao-hui, ZHANG Zhi-hong. Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM [J]. Computer Science, 2022, 49(7): 31-39.
[15] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
Full text



No Suggested Reading articles found!