计算机科学 ›› 2022, Vol. 49 ›› Issue (9): 132-138.doi: 10.11896/jsjkx.220600022

• 计算机图形学&多媒体 • 上一篇    下一篇

基于对偶变分多模态注意力网络的不完备社会事件分类方法

周旭1, 钱胜胜2, 李章明2, 方全2, 徐常胜2   

  1. 1 郑州大学河南先进技术研究院 郑州 450000
    2 中国科学院自动化研究所模式识别国家重点实验室 北京 100190
  • 收稿日期:2022-06-02 修回日期:2022-07-05 出版日期:2022-09-15 发布日期:2022-09-09
  • 通讯作者: 徐常胜(csxu@nlpr.ia.ac.cn)
  • 作者简介:(yinranzhou@foxmail.com)
  • 基金资助:
    国家自然科学基金(61936005)

Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification

ZHOU Xu1, QIAN Sheng-sheng2, LI Zhang-ming2, FANG Quan2, XU Chang-sheng2   

  1. 1 Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450000,China
    2 National Key Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2022-06-02 Revised:2022-07-05 Online:2022-09-15 Published:2022-09-09
  • About author:ZHOU Xu,born in 1997, postgraduate.His main research interests include na-tural language processing and multi-media computing analysis.
    XU Chang-sheng,born in 1969,Ph.D,professor.His main research interests include computer vision and multimedia computing analysis.
  • Supported by:
    National Natural Science Foundation of China(61936005).

摘要: 互联网的快速发展和社交媒体规模的不断扩大,带来丰富的社会事件资讯,社会事件分类任务越来越具有挑战性。充分利用图像级和文本级信息是社会事件分类的关键所在。然而,现存的方法大多存在以下局限性:1)现有的多模态方法大多都有一个理想的假设,即每种模态的样本都是充分和完整的,但在实际生活应用中这个假设并不总是成立,会存在事件某个模态缺失的情况;2)大部分方法只是简单地将社会事件的图像特征和文本特征串联起来,以此得到多模态特征来对社会事件进行分类,忽视了模态之间的语义鸿沟。为了应对这些挑战,提出了一种能同时处理完备与不完备社会事件分类的对偶变分多模态注意力网络(DVMAN)。在DVMAN网络中,提出了一个新颖的对偶变分自编码器网络来生成社会事件的公共表示,并进一步重构不完备社会事件学习中缺失的模态信息。通过分布对齐和交叉重构对齐,对图像和文本潜在表示进行双重对齐,以减小不同模态之间的差距,并对缺失的模态信息进行重构,合成其潜在表示。除此之外,设计了一个多模态融合模块对社会事件的图像和文本细粒度信息进行整合,以此实现模态之间信息的互补和增强。在两个公开的事件数据集上进行了大量的实验,与现有先进方法相比,DVMAN的准确率提升了4%以上,证明了所提方法对于社会事件分类的优越性能。

关键词: 多模态, 社会事件分类, 社交媒体, 不完备数据学习

Abstract: The rapid development of the Internet and the continuous expansion of social media have brought a wealth of social event information,and the task of social event classification has become increasingly challenging.Making full use of image-level and text-level information is the key to social event classification.However,most of existing methods have the following limitations:1) Most of the existing multi-modal methods have an ideal assumption that the samples of each modality are sufficient and complete,but in real applications this assumption does not always hold and there will be cases where a certain modality of events is missing;2) Most methods simply concatenate image features and text features of social events to obtain multi-modal features to classify social events.To address these challenges,this paper proposes a dual variational multi-modal attention network(DVMAN) for social event classification to address the limitations of these existing methods.In the DVMAN network,this paper proposes a novel dual variational autoencoders network to generate public representations of social events and further reconstruct the missing modal information in incomplete social event learning.Through distribution alignment and cross-reconstruction alignment,image and text latent representations are doubly aligned to mitigate the gap between different modalities,and for the mis-sing modality information,a generative model is utilized to synthesize its latent representations.In addition,this paper designs a multi-modal fusion module to integrate the fine-grained information of images and texts of social events,so as to realize the complementation and enhancement of information between modalities.This paper conducts extensive experiments on two publicly available event datasets,compared with the existing advanced methods,the accuracy of DVMAN improves by more than 4%.It demonstrates the superior performance of the proposed method for social event classification.

Key words: Multi-modal, Social event classification, Social media, Incomplete data learning

中图分类号: 

  • TP391
[1]GOOLSBY R.Social media as crisis platform:The future ofcommunity maps/crisis maps[J].ACM Transactions on Intelligent Systems and Technology(TIST),2010,1(1):1-11.
[2]KUMAR S,BARBIER G,ABBASI M,et al.Tweettracker:An analysis tool for humanitarian and disaster relief[C]//Procee-dings of the International AAAI Conference on Web and Social Media.2011:661-662.
[3]IRINA S,LEYSIA P,JEANNETTE S.et al.Finding community through information and communication technology in disaster response[C]//Proceedings of the ACM conference on Computer Supported Cooperative Work.2008:127-136.
[4]ABAVISANI M,WU L W,HU S L,et al.Multimodal categorization of crisis events in social media[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:14679-14689.
[5]DOUWE K,SUVRAT B,HAMED F,et al.Supervised multi-modal bitransformers for classifying images and text[J].arXiv:1909.02950,2019.
[6]FERDA O,FIROJ A,MUHAMMAD I.Analysis of social media data using multimodal deep learning for disaster response[J].arXiv:2004.11838,2020.
[7]XUKUN L DOINA C.Improving Disaster-related Tweet Classification with a Multimodal Approach[C]//Social Media for Disaster Response and Resilience Proceedings of the 17th ISCRAM Conference.2020:893-902.
[8]MAO Y D,JIANG Q P,CONG R M,et al.Cross-Modality Fusion and Progressive Integration Network for Saliency Prediction on Stereoscopic 3D Images[J].IEEE Transactions on Multimedia,2022,24:2435-2448.
[9]FIROJ A,FERDA O,MUHAMMAD I.Crisismmd:Multimodal twitter datasets from natural disasters[J].arXiv:1805.00713,2018.
[10]ELENA K,MARIA L,ARKAITZ Z.All-in-one:Multi-task lear-ning for rumour verification[J].arXiv:1806.03713,2018.
[11]BHARATH S,DAVE F,ENGIN D,et al.Short text classifica-tion in twitter to improve information filtering[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2010:841-842.
[12]TOMAS M,ILYA S,KAI C,et al.Distributed representations of words and phrases and their compositionality[J].arXiv:1310.4546,2013.
[13]BEVERLY E P,PROCESO L F,JAIME T B.Automatic classi-fication of disaster-related tweets[C]//Proceedings of International Conference on Innovative Engineering Technologies(ICIET).2014:62.
[14]LEE K,PALSETIA D,NAYAYANAN R,et al.Twitter trending topic classification[C]//IEEE 11th International Confe-rence on Data Mining Workshops.IEEE,2011:251-258.
[15]KELLY S,ZHANG X B ,AHMAD K.Mining multimodal information on social media for increased situational awareness[C]//Proceedings of the 14th International Conference on Information Systems for Crisis Response And Management.2017:613-622.
[16]MOZANNAR H,RIZK Y,AWAD M.Damage Identification in Social Media Posts using Multimodal Deep Learning[C]//The 15th International Conference on Information Systems for Crisis Response and Management(ISCRAM).2018.
[17]WU Y,ZHAN P,ZHANG Y,et al.Multimodal Fusion withCo-Attention Networks for Fake News Detection[C]//Fin-dings of the Association for Computational Linguistics.2021:2560-2569.
[18]QI P,CAO J,LI X,et al.Improving Fake News Detection by Using an Entity-enhanced Framework to Fuse Diverse Multimodal Clues[C]//Proceedings of the 29th ACM International Confe-rence on Multimedia.2021:1212-1220.
[19]IAN G,JEAN P A,MEHDI M,et al.Generative adversarialnets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2(NIPS'14).2014:2672-2680.
[20]PAN Y S, LIU M X,LIAN C F,et al.Spatially-constrained Fisher representation for brain disease identification with incomplete multi-modal neuroimages[J].IEEE Transactions on Medical Imaging 2020,39(9):2965-2975.
[21]APOORVA S,JITENDER S V,DEEPTI R B,et al.MRI toPET Cross-Modality Translation using Globally and Locally AwareGAN(GLA-GAN) for Multi-Modal Diagnosis of Alzheimer's Disease[J].arXiv:2108.02160,2021.
[22]WANG Y.Survey on deep multi-modal data analytics:collaboration,rivalry,and fusion[J].ACM Transactions on Multi-media Computing,Communications,and Applications(TOMM),2021,17(1s):1-25.
[23]GU Y C,ZHANG L,LIU Y,et al.Generalized zero-shot lear-ning via VAE-conditioned generative flow[J].arXiv:2009.00303,2020.
[24]GUO J,ZHU W.Collective affinity learning for partial cross-modal hashing[J].IEEE Transactions on Image Processing,2019,29:1344-1355.
[25]TSAI Y,HUANG L K,SALAKHUTDINOV R.Learning ro-bust visual-semantic embeddings[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:3571-3580.
[26]MUKHERJEE T,YAMADA M,M HOSPEDALES T.Deepmatching autoencoders[J].arXiv:1711.06047,2017.
[27]SUZUKI M,NAKAYAMA K,MATSUO Y.Improving bi-directional generation between different modalities with varia-tional autoencoders[J].arXiv:1801.08702,2018.
[28]ANDRÉS M,RAFAEL S R,LLUÍS G,et al.StacMR:Scene-Text Aware Cross-Modal Retrieval[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2021:2220-2230.
[29]ZHU Y,WU Y,HUGO L,et al.2021.Learning audio-visual correlations from variational cross-modal generation[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:4300-4304.
[30]DIEDERIK P K,MAX W.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[31]AKIRA F,DONG H P,DAYLEN Y,et al.Multimodal compact bilinear pooling for visual question answering and visual groun-ding[J].arXiv:1606.01847,2016.
[32]NIHAR B,KEVIN D,PEYMAN N.Generalized zero-shotlearning using multimodal variational auto-encoder with semantic concepts[C]//2021 IEEE International Conference on Image Processing(ICIP).IEEE,2021:1284-1288.
[33]FERDA O,FIROJ A,MUHAMMAD I.Analysis of socialmedia data using multimodal deep learning for disaster response[J].arXiv:2004.11838,2020.
[34]GAO H,ZHUANG L,LAURENS V D M,et al.Densely connected convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[35]JACOB D,CHANG M W,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,NAACL-HLT.2019:4171-4186.
[36]MASASHI S,MATTHIAS K,KLAUS-ROBERT M.Cova-riate shift adaptation by importance weighted cross validation[J].Journal of Machine Learning Research,2007,8(5):985-1005.
[37]DIEDERIK P K,JIMMY B.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[1] 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙.
基于自然语言的视频片段定位综述
Overview of Natural Language Video Localization
计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130
[2] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[3] 常炳国, 石华龙, 常雨馨.
基于深度学习的黑色素瘤智能诊断多模型算法
Multi Model Algorithm for Intelligent Diagnosis of Melanoma Based on Deep Learning
计算机科学, 2022, 49(6A): 22-26. https://doi.org/10.11896/jsjkx.210500197
[4] 李浩东, 胡洁, 范勤勤.
基于并行分区搜索的多模态多目标优化及其应用
Multimodal Multi-objective Optimization Based on Parallel Zoning Search and Its Application
计算机科学, 2022, 49(5): 212-220. https://doi.org/10.11896/jsjkx.210300019
[5] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[6] 赵亮, 张洁, 陈志奎.
基于双图正则化的自适应多模态鲁棒特征学习
Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization
计算机科学, 2022, 49(4): 124-133. https://doi.org/10.11896/jsjkx.210300078
[7] 刘创, 熊德意.
多语言问答研究综述
Survey of Multilingual Question Answering
计算机科学, 2022, 49(1): 65-72. https://doi.org/10.11896/jsjkx.210900003
[8] 陈志毅, 隋杰.
基于DeepFM和卷积神经网络的集成式多模态谣言检测方法
DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection
计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007
[9] 袁景凌, 丁远远, 盛德明, 李琳.
基于视觉方面注意力的图像文本情感分析模型
Image-Text Sentiment Analysis Model Based on Visual Aspect Attention
计算机科学, 2022, 49(1): 219-224. https://doi.org/10.11896/jsjkx.201000074
[10] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[11] 戴宏亮, 钟国金, 游志铭, 戴宏明.
基于Spark的舆情情感大数据分析集成方法
Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark
计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280
[12] 张晓宇, 王彬, 安卫超, 阎婷, 相洁.
基于融合损失函数的3D U-Net++脑胶质瘤分割网络
Glioma Segmentation Network Based on 3D U-Net++ with Fusion Loss Function
计算机科学, 2021, 48(9): 187-193. https://doi.org/10.11896/jsjkx.200800099
[13] 孙圣姿, 郭炳晖, 杨小博.
用于多模态语义分析的嵌入共识自动编码器
Embedding Consensus Autoencoder for Cross-modal Semantic Analysis
计算机科学, 2021, 48(7): 93-98. https://doi.org/10.11896/jsjkx.200600003
[14] 张少钦, 杜圣东, 张晓博, 李天瑞.
融合多模态信息的社交网络谣言检测方法
Social Rumor Detection Method Based on Multimodal Fusion
计算机科学, 2021, 48(5): 117-123. https://doi.org/10.11896/jsjkx.200400057
[15] 武阿明, 姜品, 韩亚洪.
基于视觉和语言的跨媒体问答与推理研究综述
Survey of Cross-media Question Answering and Reasoning Based on Vision and Language
计算机科学, 2021, 48(3): 71-78. https://doi.org/10.11896/jsjkx.201100176
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!