Computer Science ›› 2024, Vol. 51 ›› Issue (9): 258-264.doi: 10.11896/jsjkx.230700163

• Artificial Intelligence • Previous Articles     Next Articles

Image-Text Sentiment Classification Model Based on Multi-scale Cross-modal Feature Fusion

LIU Qian1, BAI Zhihao1, CHENG Chunling1, GUI Yaocheng2   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 School of Modern Posts,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2023-07-21 Revised:2023-12-28 Online:2024-09-15 Published:2024-09-10
  • About author:LIU Qian,born in 1986,Ph.D,lecturer,is a member of CCF(No.98989M).Her main research interests include artificial intelligence and sentiment analysis.
    CHENG Chunling,born in 1972,professor,is a member of CCF(No.E200015597M).Her main research interests include data mining and data management.
  • Supported by:
    Foundation of Jiangsu Provincial Double-Innovation Doctor Program(JSSCBS20210507).

Abstract: For the image-text sentiment classification task,the cross-modal feature fusion strategy which combines early fusion and Transformer model is usually used for image-text feature fusion.However,this strategy prefers to focus on the unique information within a single modality,while ignoring the interconnections and common information among multiple modalities,resulting in unsatisfactory effect of cross-modal feature fusion.To solve this problem,a method of image-text classification based on multi-scale cross-modal feature fusion is proposed.On the one hand,for the local scale,local feature fusion is carried out based on the cross-modal attention mechanism,so that the model not only focuses onthe unique information of the image and text,but also explores the connection and common information between the image and text.On the other hand,for the global scale,global feature fusion based on MLM loss enables the model to conduct global modeling of image and text data,further mine the relationship between them,and thus promote the deep fusion of image and text features.Compared with ten baseline models on two public datasets,MVSA-Single and MVSA-Multiple,the proposed method shows distinct advantages in accuracy,F1 score,and model para-meter quantity,verifying its effectiveness.

Key words: Image-Text sentiment classification, Cross-modal feature fusion, Transformer model, Attention mechanism, MLM loss

CLC Number: 

  • TP391.1
[1]ZHANG L,WANG S,LIU B.Deep learning for sentiment ana-lysis:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2018,8(4):e1253.
[2]GUO Y X,JIN Y,TANG H,et al.Multi-modal Emotion Recognition Based on Dynamic Convolution and Residual Gating[J].Computer Engineering,2023,49(7):94-101.
[3]AN X.Research on image-text sentiment analysis method based on cross-modal fusion [D].Beijing:Beijing University of Technology,2020.
[4]PETZ G,KARPPWICZ M,FURSCHUß H,et al.Reprint of:Computational approaches for mining user's opinions on the Web 2.0[J].Information Processing & Management,2015,51(4):510-519.
[5]BALTRUSAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:A survey and taxonomy[J].IEEE transactions on pattern analysis and machine intelligence,2018,41(2):423-443.
[6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of NIPS'17.2017:6000-6010.
[7]XU N,MAO W,CHEN G.A co-memory network for multimodal sentiment analysis[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrie-val.2018:929-932.
[8]NAM H,HA J W,KIM J.Dual attention networks for multimodal reasoning and matching[C]//Proceedings of CVPR'17.2017:299-307.
[9]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of ECCV'18.2018:201-216.
[10]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL'19.2019:4171-4186.
[11]AI-TAMEEMI I K S,FEIZI-DERAKHSHI M R,PASHAZADEH S,et al.A Comprehensive Review of Visual-Textual Sentiment Analysis from Social Media Networks[J].arXiv:2207.02160,2022.
[12]CAI G,XIA B.Convolutional neural networks for multimedia sentiment analysis [C]//Proceedings of NLPCC'15.2015:159-167.
[13]XU N,MAO W.Multisentinet:A deep semantic network for multimodal sentiment analysis [C]//Proceedings of CIKM'17.2017:2399-2402.
[14]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[15]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[16]CHEEMA G S,HAKIMOV S,MULLER-BUDACK E,et al.A fair and comprehensive comparison of multimodal tweet sentiment analysis methods[C]//Proceedings of MMPT'21.2021:37-45.
[17]LI Z,XU B,ZHU C,et al.CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection [C]//Findings of NAACL'22.2022:2282-2294.
[18]WEI Y,YUAN S,YANG R,et al.Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection[C]//Proceedings of ACL'23.2023:5240-5252.
[19]WANG H,LI X,REN Z,et al.Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion[J].Sensors,2023,23(5):2679.
[20]ZHANG Z,CUI P,ZHU W.Deep learning on graphs:A survey[J].IEEE Transactions on Knowledge and Data Engineering,2020,34(1):249-270.
[21]LIAO W,ZENG B,LIU J,et al.Image-text interaction graph neural network for image-text sentiment analysis[J].Applied Intelligence,2022,52(10):11184-11198.
[22]YANG X,FENG S,ZHANG Y,et al.Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of ACL'21.2021:328-339.
[23]JIANG T,WANG J,LIU Z,et al.Fusion-extraction network for multimodal sentiment analysis[C]//Proceedings of PAKDD'20.2020:785-797.
[24]PENG C,ZHANG C,XUE X,et al.Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification[J].Tsinghua Science and Technology,2021,27(4):664-679.
[25]YU Y,LIN H,MENG J,et al.Visual and textual sentimentanalysis of a microblog using deep convolutional neural networks[J].Algorithms,2016,9(2):41.
[26]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[J].Advances in Neural Information Processing Systems,2021,34:9694-9705.
[27]ZHAO J,LI R,JIN Q,et al.Memobert:Pre-training model with prompt-based learning for multimodal emotion recognition[C]//Proceedings of ICASSP'22.2022:4703-4707.
[28]SUN C,MYERS A,VONDRICK C,et al.Videobert:A jointmodel for video and language representation learning[C]//Proceedings of ICCV'19.2019:7464-7473.
[29]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition [C]//Proceedings of CVPR'16.2016:770-778.
[30]NIU T,ZHU S,PANG L,et al.Sentiment analysis on multi-view social data[C]//Proceedings of MMM'16.2016:15-27.
[31]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of EMNLP'20.2020:38-45.
[1] LI Yunchen, ZHANG Rui, WANG Jiabao, LI Yang, WANG Ziqi, CHEN Yao. Re-parameterization Enhanced Dual-modal Realtime Object Detection Model [J]. Computer Science, 2024, 51(9): 162-172.
[2] HU Pengfei, WANG Youguo, ZHAI Qiqing, YAN Jun, BAI Quan. Night Vehicle Detection Algorithm Based on YOLOv5s and Bistable Stochastic Resonance [J]. Computer Science, 2024, 51(9): 173-181.
[3] LI Zhe, LIU Yiyang, WANG Ke, YANG Jie, LI Yafei, XU Mingliang. Real-time Prediction Model of Carrier Aircraft Landing Trajectory Based on Stagewise Autoencoders and Attention Mechanism [J]. Computer Science, 2024, 51(9): 273-282.
[4] LIU Qilong, LI Bicheng, HUANG Zhiyong. CCSD:Topic-oriented Sarcasm Detection [J]. Computer Science, 2024, 51(9): 310-318.
[5] YAO Yao, YANG Jibin, ZHANG Xiongwei, LI Yihao, SONG Gongkunkun. CLU-Net Speech Enhancement Network for Radio Communication [J]. Computer Science, 2024, 51(9): 338-345.
[6] WEI Xiangxiang, MENG Zhaohui. Hohai Graphic Protein Data Bank and Prediction Model [J]. Computer Science, 2024, 51(8): 117-123.
[7] LIU Sichun, WANG Xiaoping, PEI Xilong, LUO Hangyu. Scene Segmentation Model Based on Dual Learning [J]. Computer Science, 2024, 51(8): 133-142.
[8] ZHANG Rui, WANG Ziqi, LI Yang, WANG Jiabao, CHEN Yao. Task-aware Few-shot SAR Image Classification Method Based on Multi-scale Attention Mechanism [J]. Computer Science, 2024, 51(8): 160-167.
[9] WANG Qian, HE Lang, WANG Zhanqing, HUANG Kun. Road Extraction Algorithm for Remote Sensing Images Based on Improved DeepLabv3+ [J]. Computer Science, 2024, 51(8): 168-175.
[10] XIAO Xiao, BAI Zhengyao, LI Zekai, LIU Xuheng, DU Jiajin. Parallel Multi-scale with Attention Mechanism for Point Cloud Upsampling [J]. Computer Science, 2024, 51(8): 183-191.
[11] PU Bin, LIANG Zhengyou, SUN Yu. Monocular 3D Object Detection Based on Height-Depth Constraint and Edge Fusion [J]. Computer Science, 2024, 51(8): 192-199.
[12] ZHANG Junsan, CHENG Ming, SHEN Xiuxuan, LIU Yuxue, WANG Leiquan. Diversified Label Matrix Based Medical Image Report Generation [J]. Computer Science, 2024, 51(8): 200-208.
[13] WANG Chao, TANG Chao, WANG Wenjian, ZHANG Jing. Infrared Human Action Recognition Method Based on Multimodal Attention Network [J]. Computer Science, 2024, 51(8): 232-241.
[14] ZHANG Lu, DUAN Youxiang, LIU Juan, LU Yuxi. Chinese Geological Entity Relation Extraction Based on RoBERTa and Weighted Graph Convolutional Networks [J]. Computer Science, 2024, 51(8): 297-303.
[15] CHEN Shanshan, YAO Subin. Study on Recommendation Algorithms Based on Knowledge Graph and Neighbor PerceptionAttention Mechanism [J]. Computer Science, 2024, 51(8): 313-323.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!