Computer Science ›› 2025, Vol. 52 ›› Issue (11A): 241100071-8.doi: 10.11896/jsjkx.241100071

• Artificial Intelligence • Previous Articles     Next Articles

Multimodal Entity-Relation Joint Extraction Method Based on Quantum Transformer

LI Daiyi, KONG Delong, WU Huaiguang, ZHANG Jiahui, HAN Yucan   

  1. College of Computer Science and Technology,Zhengzhou University of Light Industry,Zhengzhou 450000,China
  • Online:2025-11-15 Published:2025-11-10
  • About author:LI Daiyi,born in 1988,Ph.D,is a member of CCF(No.58208G).His main research interests include natural language processing,knowledge graphs and big data,etc.
    WU Huaiguang,born in 1976,Ph.D,professor,is a member of CCF(No.13128D).His main research interests include big data,ubiquitous computing,and formal methods,etc.
  • Supported by:
    National Natural Science Foundation of China(61672470),Major Science and Technology Research Projects in Henan Province(221100210400),Major Public Welfare Projects in Henan Province,China(201300210200) and Doctoral Research Fund of Zhengzhou University of Light Industry(2024BSJJ014).

Abstract: Multimodal Name Entity Recognition(MNER) and Multimodal Relation Extraction(MRE) are two key technologies in the construction of multimodal knowledge graphs.However,the existing MNER and MRE methods still have certain limitations in feature extraction and fusion of high-dimensional data.To address these issues,this paper proposes a multimodal entity relation joint extraction method based on quantum Transformer.Firstly,a parameterized quantum circuit for text data processing is design,which utilizes the superposition and entanglement characteristics in quantum mechanics,and combines with the Transformer model to extract deep features from text;Secondly,the pyramid visual feature extraction model is designed to obtain hierarchical features from high to low,which fully considers the multi-scale information of the image.Finally,by designing a hierarchical visual prefix network,the hierarchical multi-scale image features are aligned and fused with the text features to obtain a highly robust text representation.This study provides a new research approach for multimodal entity relation joint extraction.Experimental results on three public benchmark datasets show that the multimodal entity relation extraction method based on quantum Transformer proposed in this paper is effective and stable.

Key words: MNER, MRE, Pyramid visual feature, Transformer, Feature fusion

CLC Number: 

  • TP391
[1]LI J,SUN A,HAN J,et al.A survey on deep learning for namedentity recognition[J].IEEE Transactions on Knowledge and Data Engineering,2020,34(1):50-70.
[2]LI D,YAN L,YANG J,et al.Dependency syntax guidedbert-bilstm-gam-crf for chinese ner[J].Expert Systems with Applications,2022,196:116682.
[3]MOON S,NEVES L,CARVALHO V.Multimodal Named Entity Recognition for Short Social Media Posts[C]//Proceedings of NAACL-HLT.2018:852-860.
[4]ZHENG C,FENG J,FU Z,et al.Multimodal relation extraction with efficient graph alignment[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5298-5306.
[5]SUN L,WANG J,ZHANG K,et al.RpBERT:a text-image relation propagation-based BERT model for multimodal NER[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:13860-13868.
[6]XU Z,WANG C,QIU M,et al.Making pre-trained language models end-to-end few-shot learners with contrastive prompt tuning[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.2023:438-446.
[7]SUN S,GAO H.Meta-AdaM:An meta-learned adaptive optimizer with momentum for few-shot learning[J].Advances in Neural Information Processing Systems,2023,36:65441-65455.
[8]WANG Y,SUN Y,FU Y,et al.Spectrum-BERT:pre-training of deep bidirectional transformers for spectral classification of Chinese liquors[J].IEEE Transactions on Instrumentation and Measurement,2024,73:1-13.
[9]HAN B,HE L,KE J,et al.Weighted parallel decoupled feature pyramid network for object detection[J].Neurocomputing,2024,593:127809.
[10]TIWARI P,ZHANG L,QU Z,et al.Quantum fuzzy neural network for multimodal sentiment and sarcasm detection[J].Information Fusion,2024,103:102085.
[11]PHUKAN A,HAQ KHAN A,EKBAL A.QuMIN:quantum multi-modal data fusion for humor detection[J].Multimedia Tools and Applications,2025,84(18):18855-18872.
[12]XU B,HUANG S,SHA C,et al.MAF:a general matching and alignment framework for multimodal named entity recognition[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.2022:1215-1223.
[13]CHEN X,ZHANG N,XIE X,et al.Knowprompt:Knowledge-aware prompt-tuning with synergistic optimization for relation extraction[C]//Proceedings of the ACM Web Conference 2022.2022:2778-2788.
[14]ZHANG Q,FU J,LIU X,et al.Adaptive co-attention network for named entity recognition in tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[15]NIE Y,TIAN Y,WAN X,et al.Named Entity Recognition for Social Media Texts with Semantic Augmentation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:1383-1391.
[16]YU J,JIANG J,YANG L,et al.Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3342-3352.
[17]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[18]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[19]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[C]//International Confe-rence on Learning Representations.2019.
[20]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120.
[21]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.2019:5100-5111.
[22]LU J,BATRA D.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J].Advances in Neural Information Processing Systems,2019,32.
[23]ZHANG D,WEI S,LI S,et al.Multi-modal graph fusion fornamed entity recognition with targeted visual guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:14347-14355.
[24]ZHANG T M,ZHANG S,LIU X,et al.Multimodal Data fusion for Few-shot Named Entity Recognition Method[J].Journal of Software,2024,35(3):1107-1124.
[25]ZHENG C,FENG J,FU Z,et al.Multimodal relation extraction with efficient graph alignment[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5298-5306.
[26]WU J K,LI W J.Remote Supervised Relationship Extraction Method Based on PCNN Similar Sentence Bag Attention [J].Journal of Chinese Information Science,2024,38(5):65-75.
[27]SOARES L B,FITZGERALD N,LING J,et al.Matching the Blanks:Distributional Similarity for Relation Learning[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2895-2905.
[28]WU Z,ZHENG C,CAI Y,et al.Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1038-1046.
[29]CHEN X,ZHANG N,LI L,et al.Good Visual Guidance Make A Better Extractor:Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction[C]//Findings of the Association for Computational Linguistics:NAACL 2022.2022:1607-1618.
[30]CHEN X,ZHANG N,LI L,et al.Hybrid transformer withmulti-level fusion for multimodal knowledge graph completion[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:904-915.
[31]ZHENG C,FENG J,CAI Y,et al.Rethinking Multimodal Entity and Relation Extraction from a Translation Point of View[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:6810-6824.
[1] YU Ding, LI Zhangwei. Prediction Method of RNA Secondary Structure Based on Transformer Architecture [J]. Computer Science, 2026, 53(3): 375-382.
[2] SONG Jianhua, HE Jiawei, ZHANG Yan. Dual-channel Source Code Vulnerability Detection Model Based on Contrastive Learning [J]. Computer Science, 2026, 53(3): 424-432.
[3] CHEN Han, XU Zefeng, JIANG Jiu, FAN Fan, ZHANG Junjian, HE Chu, WANG Wenwei. Large Language Model and Deep Network Based Cognitive Assessment Automatic Diagnosis [J]. Computer Science, 2026, 53(3): 41-51.
[4] LI Zequn, DING Fei. Fatigue Driving Detection Based on Dual-branch Fusion and Segmented Domain AdaptationTransfer Learning [J]. Computer Science, 2026, 53(3): 78-87.
[5] LI Jiahao, JING Junchang, XU Qian, LIU Dong. GTKT:Knowledge Tracing Model Integrating Connectivism Learning and Multi-layer TemporalGraph Transformer [J]. Computer Science, 2026, 53(2): 78-88.
[6] PAN Jian, WANG Xuhao. Time Series Forecasting Model Integrating Multi-scale Features and Attention Mechanism [J]. Computer Science, 2026, 53(2): 180-186.
[7] HUANG Jing, WANG Teng, LIU Jian, HU Kai, PENG Xin, HUANG Yamin, WEN Yuanqiao. Multimodal Visual Detection for Underwater Sonar Target Images [J]. Computer Science, 2026, 53(2): 227-235.
[8] LIU Chenhong, LI Fenglian, YANG Jia, WANG Suzhe, CHEN Guijun. Boundary-focused Multi-scale Feature Fusion Network for Stroke Lesion Segmentation [J]. Computer Science, 2026, 53(2): 264-272.
[9] FAN Jiabin, WANG Baohui, CHEN Jixuan. Method for Symbol Detection in Substation Layout Diagrams Based on Text-Image MultimodalFusion [J]. Computer Science, 2026, 53(1): 206-215.
[10] DUAN Pengting, WEN Chao, WANG Baoping, WANG Zhenni. Collaborative Semantics Fusion for Multi-agent Behavior Decision-making [J]. Computer Science, 2026, 53(1): 252-261.
[11] ZHANG Xiaomin, ZHAO Junzhi, HE Hongjie. Screen-shooting Resilient Watermarking Method for Document Image Based on Attention Mechanism [J]. Computer Science, 2026, 53(1): 413-422.
[12] WANG Cheng, JIN Cheng. KAN-based Unsupervised Multivariate Time Series Anomaly Detection Network [J]. Computer Science, 2026, 53(1): 89-96.
[13] DENG Jiayan, TIAN Shirui, LIU Xiangli, OUYANG Hongwei, JIAO Yunjia, DUAN Mingxing. Trajectory Prediction Method Based on Multi-stage Pedestrian Feature Mining [J]. Computer Science, 2025, 52(9): 241-248.
[14] HU Hailong, XU Xiangwei, LI Yaqian. Drug Combination Recommendation Model Based on Dynamic Disease Modeling [J]. Computer Science, 2025, 52(9): 96-105.
[15] LUO Chi, LU Lingyun, LIU Fei. Partial Differential Equation Solving Method Based on Locally Enhanced Fourier NeuralOperators [J]. Computer Science, 2025, 52(9): 144-151.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!