计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241100071-8.doi: 10.11896/jsjkx.241100071

• 人工智能 • 上一篇    下一篇

基于量子Transformer的多模态实体关系联合抽取方法

李代祎, 孔德龙, 吴怀广, 张佳慧, 韩宇璨   

  1. 郑州轻工业大学计算机科学与技术学院 郑州 450000
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 吴怀广(lidaiyiyy@163.com)
  • 作者简介:lidaiyiyy@163.com
  • 基金资助:
    国家自然科学基金(61672470);河南省重大科技专项(超导量子芯片设计与制备关键技术研究)项目(221100210400);河南省重大公益项目(201300210200);郑州轻工业大学博士基金项目(2024BSJJ014)

Multimodal Entity-Relation Joint Extraction Method Based on Quantum Transformer

LI Daiyi, KONG Delong, WU Huaiguang, ZHANG Jiahui, HAN Yucan   

  1. College of Computer Science and Technology,Zhengzhou University of Light Industry,Zhengzhou 450000,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    National Natural Science Foundation of China(61672470),Major Science and Technology Research Projects in Henan Province(221100210400),Major Public Welfare Projects in Henan Province,China(201300210200) and Doctoral Research Fund of Zhengzhou University of Light Industry(2024BSJJ014).

摘要: 多模态命名实体识别(Multimodal Name Entity Recognition,MNER)和多模态关系抽取(Multimodal Relation Extraction,MRE)是多模态知识图谱构建中的两个关键技术。然而,现有的MNER和MRE方法在对高维数据进行特征提取和融合时还存在一定的局限性。为了解决这些问题,提出了一种基于量子Transformer的多模态实体关系联合抽取方法。首先,设计一种针对文本数据处理的参数化量子电路,该线路利用量子力学中的叠加和纠缠特性,结合Transformer模型提取文本深层特征;其次,通过设计的金字塔视觉特征提取模型获取包含从高到底的金字塔状的层次特征,充分考虑到了图像的多尺度信息。最后,通过设计的分层视觉前缀网络将分层多尺度图像特征与文本特征对齐并融合,获取鲁棒性高的文本表示。本研究为多模态实体关系抽取提供了新的研究思路,在3个公开基准数据集上的实验结果表明,提出的基于量子Transformer多模态实体关系抽取方法是有效且稳定的。

关键词: 多模态实体识别, 多模态关系抽取, 金字塔特征, Transformer, 特征融合

Abstract: Multimodal Name Entity Recognition(MNER) and Multimodal Relation Extraction(MRE) are two key technologies in the construction of multimodal knowledge graphs.However,the existing MNER and MRE methods still have certain limitations in feature extraction and fusion of high-dimensional data.To address these issues,this paper proposes a multimodal entity relation joint extraction method based on quantum Transformer.Firstly,a parameterized quantum circuit for text data processing is design,which utilizes the superposition and entanglement characteristics in quantum mechanics,and combines with the Transformer model to extract deep features from text;Secondly,the pyramid visual feature extraction model is designed to obtain hierarchical features from high to low,which fully considers the multi-scale information of the image.Finally,by designing a hierarchical visual prefix network,the hierarchical multi-scale image features are aligned and fused with the text features to obtain a highly robust text representation.This study provides a new research approach for multimodal entity relation joint extraction.Experimental results on three public benchmark datasets show that the multimodal entity relation extraction method based on quantum Transformer proposed in this paper is effective and stable.

Key words: MNER, MRE, Pyramid visual feature, Transformer, Feature fusion

中图分类号: 

  • TP391
[1]LI J,SUN A,HAN J,et al.A survey on deep learning for namedentity recognition[J].IEEE Transactions on Knowledge and Data Engineering,2020,34(1):50-70.
[2]LI D,YAN L,YANG J,et al.Dependency syntax guidedbert-bilstm-gam-crf for chinese ner[J].Expert Systems with Applications,2022,196:116682.
[3]MOON S,NEVES L,CARVALHO V.Multimodal Named Entity Recognition for Short Social Media Posts[C]//Proceedings of NAACL-HLT.2018:852-860.
[4]ZHENG C,FENG J,FU Z,et al.Multimodal relation extraction with efficient graph alignment[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5298-5306.
[5]SUN L,WANG J,ZHANG K,et al.RpBERT:a text-image relation propagation-based BERT model for multimodal NER[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:13860-13868.
[6]XU Z,WANG C,QIU M,et al.Making pre-trained language models end-to-end few-shot learners with contrastive prompt tuning[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.2023:438-446.
[7]SUN S,GAO H.Meta-AdaM:An meta-learned adaptive optimizer with momentum for few-shot learning[J].Advances in Neural Information Processing Systems,2023,36:65441-65455.
[8]WANG Y,SUN Y,FU Y,et al.Spectrum-BERT:pre-training of deep bidirectional transformers for spectral classification of Chinese liquors[J].IEEE Transactions on Instrumentation and Measurement,2024,73:1-13.
[9]HAN B,HE L,KE J,et al.Weighted parallel decoupled feature pyramid network for object detection[J].Neurocomputing,2024,593:127809.
[10]TIWARI P,ZHANG L,QU Z,et al.Quantum fuzzy neural network for multimodal sentiment and sarcasm detection[J].Information Fusion,2024,103:102085.
[11]PHUKAN A,HAQ KHAN A,EKBAL A.QuMIN:quantum multi-modal data fusion for humor detection[J].Multimedia Tools and Applications,2025,84(18):18855-18872.
[12]XU B,HUANG S,SHA C,et al.MAF:a general matching and alignment framework for multimodal named entity recognition[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.2022:1215-1223.
[13]CHEN X,ZHANG N,XIE X,et al.Knowprompt:Knowledge-aware prompt-tuning with synergistic optimization for relation extraction[C]//Proceedings of the ACM Web Conference 2022.2022:2778-2788.
[14]ZHANG Q,FU J,LIU X,et al.Adaptive co-attention network for named entity recognition in tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[15]NIE Y,TIAN Y,WAN X,et al.Named Entity Recognition for Social Media Texts with Semantic Augmentation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:1383-1391.
[16]YU J,JIANG J,YANG L,et al.Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3342-3352.
[17]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[18]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[19]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[C]//International Confe-rence on Learning Representations.2019.
[20]CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:104-120.
[21]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.2019:5100-5111.
[22]LU J,BATRA D.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J].Advances in Neural Information Processing Systems,2019,32.
[23]ZHANG D,WEI S,LI S,et al.Multi-modal graph fusion fornamed entity recognition with targeted visual guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:14347-14355.
[24]ZHANG T M,ZHANG S,LIU X,et al.Multimodal Data fusion for Few-shot Named Entity Recognition Method[J].Journal of Software,2024,35(3):1107-1124.
[25]ZHENG C,FENG J,FU Z,et al.Multimodal relation extraction with efficient graph alignment[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5298-5306.
[26]WU J K,LI W J.Remote Supervised Relationship Extraction Method Based on PCNN Similar Sentence Bag Attention [J].Journal of Chinese Information Science,2024,38(5):65-75.
[27]SOARES L B,FITZGERALD N,LING J,et al.Matching the Blanks:Distributional Similarity for Relation Learning[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2895-2905.
[28]WU Z,ZHENG C,CAI Y,et al.Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1038-1046.
[29]CHEN X,ZHANG N,LI L,et al.Good Visual Guidance Make A Better Extractor:Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction[C]//Findings of the Association for Computational Linguistics:NAACL 2022.2022:1607-1618.
[30]CHEN X,ZHANG N,LI L,et al.Hybrid transformer withmulti-level fusion for multimodal knowledge graph completion[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:904-915.
[31]ZHENG C,FENG J,CAI Y,et al.Rethinking Multimodal Entity and Relation Extraction from a Translation Point of View[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:6810-6824.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!