计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211000161-7.doi: 10.11896/jsjkx.211000161

• 图像处理&多媒体技术 • 上一篇    下一篇

基于多模态注意力的噪声事件分类模型

吴贺祥, 王中卿, 李培峰   

  1. 苏州大学计算机科学与技术学院 江苏 苏州 215006
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 王中卿(wangzq@suda.edu.cn)
  • 作者简介:(20205227103@stu.suda.edu.cn)
  • 基金资助:
    国家自然科学基金(61806137,61702518,61836007);江苏省高等学校自然科学研究面上项目(18KJB520043);江苏高校优势学科建设工程资助项目

Noise Event Classification Model Based on Multimodal Attention

WU He-xiang, WANG Zhong-qing, LI Pei-feng   

  1. School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:WU He-xiang,born in 1998,postgra-duate.His main research interests include natural language processing and so on.
    WANG Zhong-qing,born in 1987,Ph.D,lecturer,is a member of China Compu-ter Federation.His main research in-terest is natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61806137,61702518,61836007),Natural Science Foundation of Jiangsu Province(18KJB520043) and A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 如今,社交媒体因其低成本、易于访问和快速传播而成为人们获取新闻资讯和了解实时事件的主要渠道之一。社交媒体为分析特定事件提供了包含文本和图像等多种模态的信息,这其中包含了大量无关事件和虚假信息。为此,结合文本-图像对来判断文本和图像是否提供了与特定事件相关的信息,从而筛选出与之无关的噪声事件。由于文本中的描述往往与相对应的图像中的情景相关联,因此提出了一个基于多模态注意力的结合文本和图像信息的方法进行事件分类。该方法能很好地关注到文本和图像中的重要信息并促进不同模态的信息交互。在CrisisMMD数据集上的实验结果表明,该方法优于6种强的基线方法,证明了所提多模态注意力模型能够有效融合不同模态的特征,得到更优的联合表示。

关键词: 注意力机制, 多模态融合, 噪声事件分类

Abstract: Social media is nowadays one of the main channels for people to obtain news and learn about real-time events due to its low cost,easy access and rapid dissemination.Social media provides a variety of modal information including text and images for analyzing specific events,which contains abundant irrelevant events and false information.To this end,this paper combines the text-image pairs to determine whether the text and image provide information related to specific events,so as to find out irrelevant noise events from the sentence-level of the text.Motivated by the observation that the description in the text is often associated with the scene in the corresponding image,this paper proposes a method of combining text and image information to classify events based on attention mechanism,which can effectively attend to the important information in text and image and promote information interaction in different modalities.Experimental results on CrisisMMD show that our model outperforms six strong baselines,and it can effectively fuse features of different modality to obtain a superior joint representation.

Key words: Attention mechanism, Multimodal fusion, Noise event classification

中图分类号: 

  • TP391
[1]ALAM F,OFLI F,IMRAN M.CrisisMMD:Multimodal twitter datasets from natural disasters[C]//Proceedings of the Twelfth International Conference on Web and Social Media.California,USA:2018:465-473.
[2]OFLI F,ALAM F,IMRAN M.Analysis of social media datausing multimodal deep learning for disaster response[C]//Proceedings of the 17th ISCRAM Conference.Blacksburg,VA,USA:2020.
[3]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association of Computational Linguistics:Human Language Technologies.Minneapolis,Minnesota:Association for Computational Linguistics,2019:4181-4186.
[4]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations.San Diego,CA,USA:2015.
[5]YANG Z,HE X,GAO J,et al.Stacked attention networks forimage question answering[C]//Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA,2016:21-29.
[6]LIAO S,GRISHMAN R.Using document level cross-event inference to improve event extraction[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Uppsala,Sweden,2010:789-797.
[7]HONG Y,ZHANG J,MA B,et al.Using cross-entity inference to improve event extraction[C]//Proceedings of the 49th AnnualMeeting of the Association for Computational Linguistics.Portland,Oregon,USA,2011:1127-1136.
[8]LI Q,JI H,HUANG L.Joint event extraction via structured prediction with global features[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,Bulgaria,2013:73-82.
[9]CHEN Y,XU L,LIU K,et al.Event extraction via dynamic multi-pooling convolutional neural networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computa-tional Linguistics and the 7th International Joint Conference on Natural Language Processing.2015:167-176.
[10]NGUYEN T H,CHO K,GRISHMAN R.Joint event extraction via recurrent neural networks[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany,2016:300-309.
[11]YANG S,FENG D,QIAO L,et al.Exploring pretrained lan-guage models for event extraction and generation[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy,2019:5284-5294.
[12]SHA L,QIAN F,CHANG B,et al.Jointly extracting event triggers and arguments by dependency-bridge RNN and tensor-based argument interaction[C]//Proceedings of the Thirty-Se-cond AAAI Conference on Artificial Intelligence.New Orleans,Louisiana,USA,2018:5916-5923.
[13]HOCHREITER S,SCHMIDHUBER J.Long short-term Memory[J].Neural Computation,1997,9(8):1735-1780.
[14]LIU X,LUO Z,HUANG H.Jointly multiple events extraction via attention-based graph information Aggregation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels,Belgium,2018:1247-1256.
[15]WANG X,HAN X,LIU Z,et al.Adversarial training for weakly supervised event detection[C]//Proceedings of the 2019Confe-rence of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Min-neapolis,MN,USA,2019:998-1008.
[16]TONG M,XU B,WANG S,et al.Improving event detection via open-domain trigger knowledge[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online,ACL,2020:5887-5897.
[17]ZHANG T,WHITEHEAD S,ZHANG H,et al.ImprovingEvent Extraction via Multimodal Integration[C]//Proceedings of the 2017 ACM on Multimedia Conference.Mountain View,CA,USA,2017:270-278.
[18]LI M,ZAREIAN A,ZENG Q,et al.Cross-media structuredcommon space for multimedia event extraction[C]//Procee-dings of the 58th Annual Meeting of the Association for Computational Linguistics.Online,2020:2557-2568.
[19]TONG M,WANG S,CAO Y,et al.Image enhanced event detection in news articles[C]//Proceedings of the 34th AAAI Confe-rence on Artificial Intelligence.New York,NY,USA,2020:9040-9047.
[20]D’MELLO S K,KORY J M.A review and meta-analysis of multimodal affect detection systems[J].ACM Computing Surveys,2015,47(3):43:1-43:36.
[21]MORVANT E,HABRARD A,AYACHE S.Majority vote of diverse classifiers for late fusion[C]//Structural,Syntactic,and Statistical Pattern Recognition-Joint IAPR International Workshop.Joensuu,Finland,2014:153-162.
[22]WANG Y,HUANG W,SUN F,et al.Deep multimodal fusion by channel exchanging[C]//Proceedings of the Thirty-fourth Conference on Neural Information Processing Systems.Vir-tual,2020.
[23]PEREZ-RUA J M,VIELZEUF V,PATEUX S,et al.MFAS:Multimodal fusion architecture search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Long Beach,CA,USA,2019:6959-6968.
[24]ZADEH A,CHEN M,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Proces-sing.Copenhagen,Denmark,2017:1103-1114.
[25]SAHU G,VECHTOMOVA O.Adaptive fusion techniques for multimodal data[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:3156-3166.
[26]HORI C,HORI T,LEE T Y,et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice,Italy,2017:4203-4212.
[27]KIM Y.Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Doha,Qatar,2014:1746-1751.
[28]LU J,YANG J.BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Proceedings of the Thirtieth Annual Conference on Neural Information Processing Systems.Barcelona,Spain,2016:289-297.
[29]NAM H,HA J W,KIM J.Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,HI,USA,2017:2156-2164.
[30]MAJUMDAR A,SHRIVASTAVA A,LEE S,et al.Improving vision-and-language navigation with image-text pairs from the web[C]//Proceedings of the 2020 European Conference on Computer Vision.Glasgow,UK,ECCV,2020:259-274.
[31]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Miami,Florida,USA,2009:248-255.
[32]ALI S R,HOSSEIN A,JOSEPHINE S,et al.CNN features off-the-shelf:An astounding baseline for recognition[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition,CVPR Workshops 2014.Columbus,OH,USA,2014:512-519.
[33]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations,ICLR 2015.San Diego,Ca,USA.
[34]KIELA D,BHOOSHAN S,FIROOZ H,et al.Supervised multimodal bitransformers for classifying images and text[J].arXiv:1909.02950.
[35]LI X,YIN X,LI C,et al.Oscar:Object-semantic aligned pre-training for vision-language tasks[C]//Proceedings of the 16th European Conference on Computer Vision.Glasgow,UK,2020:121-137.
[1] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[7] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[8] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[10] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[11] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[12] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[13] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[14] 孟月波, 穆思蓉, 刘光辉, 徐胜军, 韩九强.
基于向量注意力机制GoogLeNet-GMP的行人重识别方法
Person Re-identification Method Based on GoogLeNet-GMP Based on Vector Attention Mechanism
计算机科学, 2022, 49(7): 142-147. https://doi.org/10.11896/jsjkx.210600198
[15] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!