计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 250600198-10.doi: 10.11896/jsjkx.250600198

• 人工智能 • 上一篇    下一篇

信息抽取技术在数字人文领域的应用研究综述

隗昊1,2,3, 张宗煜1, 刁宏悦1,2,3, 邓耀臣2,3   

  1. 1 大连外国语大学软件学院 辽宁 大连 116044
    2 大连外国语大学中国东北亚语言研究中心 辽宁 大连 116044
    3 大连外国语大学辽宁省新文科数字人文创新实验室 辽宁 大连 116044
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 邓耀臣(deng_yaochen@163.com)
  • 作者简介:weihao1005@163.com
  • 基金资助:
    国家语委“十四五”科研规划2023年度一般项目(YB145-82);辽宁省高等学校基本科研项目(LJKQZ20222451,JYTQN2023149);辽宁省自然科学基金计划博士科研启动项目(2024-BS-203)

Review of Application of Information Extraction Technology in Digital Humanities

WEI Hao1,2,3, ZHANG Zongyu1, DIAO Hongyue1,2,3, DENG Yaochen2,3   

  1. 1 School of Software,Dalian University of Foreign Languages,Dalian,Liaoning 116044,China
    2 China Research Center for Northeast Asian Languages,Dalian University of Foreign Languages,Dalian,Liaoning 116044,China
    3 Liaoning New Lab for Innovations in Digital Humanities,Dalian University of Foreign Languages,Dalian,Liaoning 116044,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    General Projects of the 14th Five-Year Plan of the Language Commission, China(YB145-82),Liaoning Provincial Department of Education project,China(LJKQZ20222451,JYTQN2023149) and Natural Science Foundation of Liaoning Province,China(2024-BS-203).

摘要: 数字人文作为计算机科学与人文学科交叉融合的新兴领域,旨在通过数字技术解决人文研究中的问题,推动学科发展、文化遗产保护和文化传播。信息抽取技术作为自然语言处理的核心任务之一,能够从非结构化文本中自动提取结构化知识,为数字人文研究提供丰富的数据支持。该综述系统梳理了信息抽取技术在数字人文领域的应用研究,聚焦其三大子任务:命名实体识别、关系抽取和事件抽取。首先,研究梳理了各任务的发展历程和代表性工作,从早期的规则词典方法到传统机器学习,再到当前主流的深度学习和预训练语言模型方法,分析了技术演进的脉络。其次,探讨了数字人文领域信息抽取面临的主要挑战,如语料稀缺、文本结构复杂、实体边界模糊、关系表达隐晦等,并对现有方法的普适性和局限性进行了深入讨论。最后,展望了未来研究方向,包括多模态信息抽取、跨语言处理、低资源场景优化、知识图谱构建及语言生成技术等。该综述为信息抽取技术在数字人文领域的进一步研究和应用提供了理论支持和实践参考。

关键词: 数字人文, 自然语言处理, 信息抽取, 命名实体识别, 关系抽取, 事件抽取, 深度学习

Abstract: Digital humanities,as an emerging interdisciplinary field integrating computer science and humanities,aims to address research challenges in humanities through digital technologies,thereby advancing disciplinary development,cultural heritage preservation,and cultural dissemination.Information extraction,a core task in natural language processing,enables the automatic extraction of structured knowledge from unstructured texts,providing valuable data support for digital humanities research.This review systematically examines the applications of information extraction technologies in digital humanities,focusing on three key subtasks:named entity recognition,relation extraction,and event extraction.The study traces the evolution of these tasks from early rule-based and dictionary methods to traditional machine learning approaches,and further to current mainstream techniques based on deep learning and pre-trained language models,analyzing the trajectory of technological advancements.Furthermore,the review discusses the unique challenges of information extraction in digital humanities,including data scarcity,complex text structures,ambiguous entity boundaries,and implicit relationship expressions,while critically evaluating the applicability and limitations of existing methods.Finally,future research directions are outlined,such as multimodal information extraction,cross-lingual processing,optimization for low-resource scenarios,knowledge graph construction,and language generation technologies.The review offers theoretical insights and practical guidance for further research and applications of information extraction in digital humanities.

Key words: Digital humanities, Natural language processing, Information extraction, Named entity recognition, Relation extraction, Event extraction, Deep learning

中图分类号: 

  • TP391
[1]FENG Z W.Four Levels of Digital Humanities Research[J].Journal of School of Chinese Language and Culture Nanjing Normal University,2023(3):1-9.
[2]DING H D,ZHOU Z Q.Digital Humanities:A New Landscape of Social Memory Reproduction in the Digital Age[J].Information Science,2023,41(11):1-7,27.
[3]GUO X Y,HE T T.Survey about Research on Information Extraction[J].Computer Science,2015,42(2):14-17,38.
[4]NOBLE W S.What is a support vector machine?[J].Nature biotechnology,2006,24(12):1565-1567.
[5]RABINER L,JUANG B.An introduction to hidden Markovmodels[J].IEEE ASSP Magazine,1986,3(1):4-16.
[6]LAFFERTY J,MCCALLUM A,PEREIRA F C N.Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning.2001:282-289.
[7]LI Z,LIU F,YANG W,et al.A Survey of Convolutional Neural Networks:Analysis,Applications,and Prospects[J].IEEE transactions on neural networks and learning systems,2022,33(12):6999-7019.
[8]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[9]DEY R,SALEM F M.Gate-variants of gated recurrent unit(GRU) neural networks[C]//2017 IEEE 60th international midwest symposium on circuits and systems(MWSCAS).IEEE,2017:1597-1600.
[10]PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018:2227-2237.
[11]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J].URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstanding paper.pdf,2018.
[12]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[13]WANG Y Q,ZHOU Q S.A Research on Internet Open Source Information Extraction Based on Pre-trained Language Model and Intelligence Analysis Application:Take “Academic,Lecture,Forum” and Other Conference Activities as an Example[J].Information Studies:Theory & Application,2024,47(1):154-163.
[14]YANG S,FENG D,QIAO L,et al.Exploring pre-trained lan-guage models for event extraction and generation[C]//Proceedings of the 57th annual meeting of the association for computational linguistics.2019:5284-5294.
[15]GIORGI J,WANG X,SAHAR N,et al.End-to-end named entity recognition and relation extraction using pre-trained language models[J].arXiv:1912.13415,2019.
[16]LEE J,YOON W,KIM S,et al.BioBERT:a pre-trained biomedical language representation model for biomedical text mining[J].Bioinformatics,2020,36(4):1234-1240.
[17]ZHAO W X,ZHOU K,LI J,et al.A survey of large language models[J].arXiv:2303.18223,2023.
[18]NADEAU D,SEKINE S.A survey of named entity recognition and classification[J].Lingvisticae Investigationes,2007,30(1):3-26.
[19]ZHU X.Person Name Entity Recognition and Part of SpeechTagging in Ancient Chinese Chronology[D].Fudan University,2012.
[20]LE J,ZHAO X.Algorithm of Beijing Opera Organization Names Entity Recognition Based on HMM[J].Computer Engineering,2013,39(6):266-271,286.
[21]DÍEZ PLATAS M L,ROS MUNOZ S,GONZÁLEZ-BLANCO E,et al.Medieval Spanish(12th-15th centuries) named entity recognition and attribute annotation system based on contextual information[J].Journal of the Association for Information Science and Technology,2021,72(2):224-238.
[22]SHE J,ZHANG X Q.Musical named entity recognition method[J].Journal of Computer Applications,2010,30(11):2928-2931,2948.
[23]YU H K,ZHANG H P,LIU Q,et al.Chinese named entityidentification using cascaded hidden Markov model[J].Journal on Communications,2006(2):87-94.
[24]LI H,ZHU L L,LIU J Y,et al.Research on the Organization of Bamboo and Silk Medical Knowledge Based on Ontology[J].Library and Information Service,2022,66(22):16-27.
[25]WANG D B,GAO R Q,SHEN S,et al.Research on Automatic Recognition of Basic Entity Component of Historic Events for Pre-Qin Classics[J].Journal of the National Library of China,2018,27(1):65-77.
[26]ETZIONI O,CAFARELLA M,DOWNEY D,et al.Unsuper-vised named-entity extraction from the web:An experimental study[J].Artificial intelligence,2005,165(1):91-134.
[27]VAN DALEN-OSKAM K,DE DOES J,MARX M,et al.Named entity recognition and resolution for literary studies[J].Computational Linguistics in the Netherlands Journal,2014,4:121-136.
[28]LIU L,QIN T Y,WANG D B.Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J].Data Analysis and Knowledge Discovery,2020,4(12):68-75.
[29]LI N.Construction of Automatic Recognition Model of Function Entities in Local Chronicles:Produce Based on Deep Learning[J].Digital Library Forum,2022(12):19-28.
[30]ZHANG W,WANG H,DENG S H,et al.Sentiment Term Extraction and Application of Chinese Ancient Poetry Text for Digital Humanities[J].Journal of Library Science in China,2021,47(4):113-131.
[31]FAN T,WANG H,ZHANG W,et al.Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension[J].Data Analysis and Knowledge Discovery,2022,6(12):70-79.
[32]WANG L,WANG H,LI X M,et al.Thesaurus Developmentand Application in the Field of Intangible Cultural Heritage Ceramics Incorporated with Learning Extension[J].Library Tribune,2024,44(2):66-78.
[33]EMHA T L,YUSOH Z I M,ABOOBAIDER B M.BERT based named entity recognition for automated Hadith narrator identification[J].International Journal of Advanced Computer Science and Applications,2022,13(1).
[34]LIU S,YANG H,LI J,et al.Chinese named entity recognition method in history and culture field based on BERT[J].International Journal of Computational Intelligence Systems,2021,14:1-10.
[35]AFFI M,LATIRI C.Arabic named entity recognition using variant deep neural network architectures and combinatorial feature embedding based on CNN,LSTM and BERT[C]//Proceedings of the 36th Pacific Asia Conference on Language,Information and Computation.2022:302-312.
[36]FANG Z,WU L C,KONG X,et al.A Comparative Analysis of Word Segmentation,Part-of-Speech Tagging,and Named Entity Recognition for Historical Chinese Sources,1900-1950[J].arXiv:2503.19844,2025.
[37]HILTMANN T,DRÖGE M,DRESSELHAUS N,et al.NER4all or Context is All You Need:Using LLMs for low-effort,high-performance NER on historical texts.A humanities informed approach[J].arXiv:2502.04351,2025.
[38]LIU H,JIANG Q J,GUI Q J,et al.Review of research progress of entity relationship extraction[J].Application Research of Computers,2020,37(S2):1-5.
[39]CUI B,WANG D B,HUANG S Q.The Analysis of Time Distribution and Evolution Characteristics of Crops in Classics:Taking Shihuozhi as an Example[J].Library and Information Service,2021,65(14):90-100.
[40]QIAN Z Y,CHEN T,XU Y,et al.Research on Construction and Application of Knowledge Graph of Vocabulary Interpretation in Ancient Classical Dictionaries[J].Library Journal,2023,42(8):82-88,123.
[41]LOPER E E D.Applying semantic relation extraction to information retrieval[D].Massachusetts Institute of Technology,2000.
[42]EUGENE A,LUIS G.Extracting relations from large plain-text collections[J].Proc.ACM,2000,2000(10.1145):336597.336644.
[43]FAN C,LI Y.Network extraction and analysis of character relationships in Chinese literary works[J].Computational Intelligence and Neuroscience,2022,2022(1):7295834.
[44]SUN S H.Research on key technologies of information extraction in traditional Chinese medicine acupuncture and moxibustion[D].Dalian University of Technology,2020.
[45]XIE K W.Research on Text based Crop Disease and Pest Relation Extraction Technology[D].Hunan Agricultural University,2023.
[46]YANG X H,SHAN Y H,XIE D,et al.Relation Extraction of Traditional Chinese Medicine Prescription and Disease Based on Literature Abstracts Data[J].Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology,2017,19(7):1167-1172.
[47]MA Y K,FENG Y C.Research on Traditional Chinese Medical Text Implicit Relation Extraction Method[J].Journal of Zhengzhou University(Natural Science Edition),2024,56(2):34-42.
[48]ELSON D,DAMES N,MCKEOWN K.Extracting social net-works from literary fiction[C]//Proceedings of the 48th annual meeting of the association for computational linguistics.2010:138-147.
[49]SUN Y,WANG L K,GUO L L.Tibetan Entity Relation Extraction Based on Optimized Word Embedding with GRU Neural Network[J].Journal of Chinese Information Processing,2019,33(6):35-41.
[50]ZHANG Q.Research on Multidimensional Knowledge Organization and Visualization of Records of the Grand Historian[D].Nanjing Agricultural University,2022.
[51]TANG X M,SU Q,WANG J.Classifying Ancient Chinese Text Relations with Entity Information[J].Data Analysis and Knowledge Discovery,2024,8(1):114-124.
[52]SONG X Y,ZHANG X Q,ZHANG W M.Research on Know-ledge Element Organization and Visualization of Intangible Cultural Heritage of Shuishu Customs[J].Journal of Modern Information,2023,43(10):3-15.
[53]ZENG G,ZHAO X Q.Research on Knowledge Extraction and Organization of Wanli Tea Ceremony Digital Resources Based on Knowledge Elements[J].Information Studies:Theory & Application,2021,44(10):173-178,164.
[54]LIANG L X,LIN L,LIN E,et al.A Joint Learning Model to Extract Entities and Relations for Chinese Literature Based on Self-Attention[J].Mathematics,2022,10(13):2216.
[55]GABUD R,LAPITAN P,MARIANO V,et al.A Hybrid of Rule-based and Transformer-based Approaches for Relation Extraction in Biodiversity Literature[C]//Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning.2023:103-113.
[56]CRUCIANI G.Extracting Relations from Ecclesiastical Cultural Heritage Texts[C]//Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities.2024:41-50.
[57]WANG H C,ZHOU C L,PETRESCU M G.Survey on Event Extraction Based on Deep Learning[J].Journal of Software,2023,34(8):3905-3923.
[58]JIANG D L.Research on extraction of emergency event information based on rules matching[J].Computer Engineering and Design,2010,31(14):3294-3297.
[59]FENG Y H.Research on Information Extraction Technology in Tibetan Cultural Field[D].Minzu University of China,2017.
[60]CYBULSKA A,VOSSEN P.Historical event extraction fromtext[C]//Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage,Social Sciences,and Humanities.2011:39-43.
[61]YU J D,FAN X Z,PANG W B.Research on Semantic Role Labeling for Event Information Extraction[J].Computer Science,2008(3):155-157.
[62]JING Y C,HUANG Z.Public Opinions Event Extraction based on Language Feature[J].Information Security and Communications Privacy,2015(4):96-100.
[63]CHEN X X,LIU B.Extracting Open Domain Events in Microblogs[J].Computer Applications and Software,2016,33(8):18-22,109.
[64]QIU P Y,ZHANG H C,YU L,et al.Automatic Event Labeling for Traffic Information Extraction from Microblogs[J].Journal of Chinese Information Processing,2017,31(2):107-116.
[65]LI J,RITTER A,CARDIE C,et al.Major life event extraction from twitter based on congratulations/condolences speech acts[C]//Proceedings of the 2014 conference on empirical methods in natural language processing(EMNLP).2014:1997-2007.
[66]HU H J,WANG C,DAI J H,et al.Social Emergency EventJudgement Based on BiLSTM-CRF[J].Journal of Chinese Information Processing,2022,36(3):154-161.
[67]DANG J F.Research on Knowledge Extraction Method of Chinese Classics Based on Deep Learning[D].North University of China,2021.
[68]YU X H,HE L,XU J.Extracting Events from Ancient Books Based on RoBERTa-CRF[J].Data Analysis and Knowledge Discovery,2021,5(7):26-35.
[69]WANG Y Y,WANG H,ZHU H,et al.Research on the Con-struction of an Event Recognition Model for Historical Antique Books Based on Text Generation Technology[J].Library and Information Service,2023,67(3):119-130.
[70]ZHANG P J,WANG L,MA B,et al.Uyghur event extraction based on pre-trained language model[J].Computer Engineering and Design,2023,44(5):1487-1494.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!