Computer Science ›› 2026, Vol. 53 ›› Issue (6A): 250400127-6.doi: 10.11896/jsjkx.250400127

• Artificial Intelligence • Previous Articles     Next Articles

Semantic Modeling and Co-attention Mechanism for Multimodal Sarcasm Detection Method

WEI Wei1, LI Bicheng1, ZHU Zhenshui2, ZUO Jun2   

  1. 1 College of Computer Science and Technology,Huaqiao University,Xiamen,Fujian 361021,China
    2 Xiamen Meiya Boke Information Security Research Institute Company Limited,Xiamen,Fujian 361008,China
  • Online:2026-06-16 Published:2026-06-12
  • About author:WEI Wei,born in 1999,postgraduate.His main research interests include natural language processing,multimodal network public opinion knowledge graph.
    LI Bicheng,born in 1970,professor,doctoral supervisor.His main research interests include intelligent information processing,network ideological security,online public opinion monitoring and guidance,as well as big data analysis and mining.
  • Supported by:
    Construction of Multimodal Large Models in the Public Safety Domain,Product Development,and Industrial Application(3502Z20241029).

Abstract: Sarcasm is widely used in social media and other forms of computer-mediated communication.Multimodal sarcasm detection,which leverages both textual and visual information,faces challenges due to the diversity and complexity of content,often relying on implicit contrast and semantic conflict across modalities.To better capture such cross-modal semantic discrepancies,this paper proposes a method that integrates semantic modeling with a co-attention mechanism(Co-Attention Transformer).Leveraging the representational power of the CLIP pre-trained model,the approach employs co-attention to enhance deep interaction and feature fusion across modalities.Moreover,it incorporates syntactic dependency trees for graph-based modeling and introduces semantic similarity enhancement to improve semantic alignment between text and image.Experiments on a public sarcasm detection dataset demonstrate the superiority of the proposed method over traditional baselines.

Key words: Multimodal sarcasm detection, CLIP model, Semantic similarity, Co-attention mechanism, Dependency tree, Graph structure

CLC Number: 

  • TP391
[1] TIWARI D,KANOJIA D,RAY A,et al.Predict and use:Harnessing predicted gaze to improve multimodal sarcasm detection[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:15933-15948.
[2] HUANG B,YU G.Research on the mining of opinion communi-ty for social media based on sentiment analysis and regional distribution[C]//2016 Chinese Control and Decision Conference(CCDC).IEEE,2016:6900-6905.
[3] LI L,JIN D,WANG X,et al.Multi-modal sarcasm detectionbased on cross-modal composition of inscribed entity relations[C]//2023 IEEE 35th International Conference on Tools with Artificial Intelligence(ICTAI).IEEE,2023:918-925.
[4] TAY Y,TUAN L A,HUI S C,et al.Reasoning with sarcasm by reading in-between[J].arXiv:1805.02856,2018.
[5] LOU C,LIANG B,GUI L,et al.Affective dependency graph for sarcasm detection[C]//Proceedings of the 44th international ACM SIGIR Conference on Research and Development in Information Retrieval.2021:1844-1849.
[6] WANG R,WANG Q,LIANG B,et al.Masking and generation:An unsupervised method for sarcasm detection[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:2172-2177.
[7] FRENDA S,CIGNARELLA A T,BASILE V,et al.The unbearable hurtfulness of sarcasm[J].Expert Systems with Applications,2022,193:116398.
[8] YUE T,MAO R,WANG H,et al.KnowleNet:Knowledge fusion network for multimodal sarcasm detection[J].Information Fusion,2023,100:101921.
[9] LIU H,WEI R,TU G,et al.Sarcasm driven by sentiment:Asentiment-aware hierarchical fusion network for multimodal sarcasm detection[J].Information Fusion,2024,108:102353.
[10] TIAN Y,XU N,ZHANG R,et al.Dynamic routing transformer network for multimodal sarcasm detection[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2023:2468-2480.
[11] LIANG B,LOU C,LI X,et al.Multi-modal sarcasm detection via cross-modal graph convolutional network[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:1767-1777.
[12] LIU H,WANG W,LI H.Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement[C]//2022 Conference on Empirical Methods in Natural Language Processing(EMNLP 2022).Association for Computational Linguistics,2022:4995-5006.
[13] SCHIFANELLA R,DE JUAN P,TETREAULT J,et al.Detecting sarcasm in multimodal social platforms[C]//Proceedings of the 24th ACM international conference on Multimedia.2016:1136-1145.
[14] CAI Y,CAI H,WAN X.Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2506-2515.
[15] XU N,ZENG Z,MAO W.Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3777-3786.
[16] LIANG B,LOU C,LI X,et al.Multi-modal sarcasm detectionwith interactive in-modal and cross-modal graphs[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:4707-4715.
[17] PAN H,LIN Z,FU P,et al.Modeling intra and inter-modality incongruity for multi-modal sarcasm detection[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.2020:1383-1392.
[18] WU Q,FANG W,ZHONG W,et al.Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection[J].Neurocomputing,2025,612:128689.
[19] QIN L,HUANG S,CHEN Q,et al.MMSD2.0:Towards a reliable multi-modal sarcasm detection system[J].arXiv:2307.07135,2023.
[20] XU N,ZENG Z,MAO W.Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3777-3786.
[21] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR 2021).VirtualEvent,OpenReview.net,2021.
[22] CHEN Y.Convolutional neural network for sentence classification[D].University of Waterloo,2015.
[23] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,volume 1(Long and Short Papers).2019:4171-4186.
[24] PAN H,LIN Z,FU P,et al.Modeling intra and inter-modality incongruity for multi-modal sarcasm detection[C]//Findings of the Association for Computational Linguistics(EMNLP 2020).2020:1383-1392.
[25] LIANG B,LOU C,LI X,et al.Multi-modal sarcasm detectionvia cross-modal graph convolutional network[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:1767-1777.
[26] YUE T,MAO R,WANG H,et al.KnowleNet:Knowledge fusion network for multimodal sarcasm detection[J].Information Fusion,2023,100:101921.
[1] ZHENG Cheng, BAN Qingqing. Knowledge-assisted and Reinforced Syntax-driven for Aspect-based Sentiment Analysis [J]. Computer Science, 2026, 53(4): 406-414.
[2] SANG Shilong, CHEN Kejia. Type-steered Edge Matching for Heterogeneous Graph Similarity Learning [J]. Computer Science, 2026, 53(3): 181-187.
[3] HUANG Rong, TANG Yingchun, ZHOU Shubo , JIANG Xueqin. Composite Trigger Backdoor Attack Combining Visual and Textual Features [J]. Computer Science, 2026, 53(1): 382-394.
[4] YIN Wei, DOU Lin, GAO Zhongjie, WANG Lisong, SUN Qian. Method for Coupling Analysis of Requirements Models Based on Variable Dependency Relationships [J]. Computer Science, 2025, 52(7): 58-68.
[5] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
[6] YIN Wencui, XIE Ping, YE Chengxu, HAN Jiaxin, XIA Xing. Anomaly Detection of Multi-variable Time Series Data Based on Variational Graph Auto-encoders [J]. Computer Science, 2025, 52(6A): 240700124-8.
[7] KANG Bohan, GAO Wanlin, JIA Jingdun. Research on High-robustness Encoding and Localization Methods Based on Damaged QR Dode [J]. Computer Science, 2025, 52(11A): 241000179-7.
[8] ZHANG Liying, SUN Haihang, SUN Yufa , SHI Bingbo. Review of Node Classification Methods Based on Graph Convolutional Neural Networks [J]. Computer Science, 2024, 51(4): 95-105.
[9] SUN Shounan, WANG Jingbin, WU Renfei, YOU Changkai, KE Xifan, HUANG Hao. TMGAT:Graph Attention Network with Type Matching Constraint [J]. Computer Science, 2024, 51(3): 235-243.
[10] HU Shen, QIAN Yuhua, WANG Jieting, LI Feijiang, LYU Wei. Super Multi-class Deep Image Clustering Model Based on Contrastive Learning [J]. Computer Science, 2023, 50(9): 192-201.
[11] ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[12] SHAO Yunfei, SONG You, WANG Baohui. Study on Degree of Node Based Personalized Propagation of Neural Predictions forSocial Networks [J]. Computer Science, 2023, 50(4): 16-21.
[13] ZHANG Longji, ZHAO Hui. Aspect-level Sentiment Analysis Integrating Syntactic Distance and Aspect-attention [J]. Computer Science, 2023, 50(12): 262-269.
[14] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[15] LUO Yue-tong, WANG Tao, YANG Meng-nan, ZHANG Yan-kong. Historical Driving Track Set Based Visual Vehicle Behavior Analytic Method [J]. Computer Science, 2021, 48(9): 86-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!