计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 233-238.doi: 10.11896/jsjkx.191200074
张栋, 陈文亮
ZHANG Dong, CHEN Wen-liang
摘要: 命名实体识别(NER)旨在识别出文本中的专有名词,并对其进行分类。由于用于监督学习的训练数据通常由人工标注,耗时耗力,因此很难得到大规模的标注数据。为解决中文命名实体识别任务中因缺乏大规模标注语料而造成的数据稀缺问题,以及传统字向量不能解决的一字多义问题,文中使用在大规模无监督数据上预训练的基于上下文相关的字向量,即利用语言模型生成上下文相关字向量以改进中文NER模型的性能。同时,为解决命名实体识别中的未登录词问题,文中提出了基于字语言模型的中文NER系统。把语言模型学习到的字向量作为NER模型的输入,使得同一中文汉字在不同语境中有不同的表示。文中在6个中文NER数据集上进行了实验。实验结果表明,基于上下文相关的字向量可以很好地提升NER模型的性能,其平均性能F1值提升了4.95%。对实验结果进行进一步分析发现,新系统在OOV实体识别上也可以取得很好的效果,同时对一些特殊类型的中文实体识别也有不错的表现。
中图分类号:
[1]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (almost) from scratch [J].Journal of Machine Learning Research,2011,12(8):2493-2537. [2]MIKOLOU T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality [C]//Proc of NIPS.Cambridge:MIT Press,2013:3111-3119. [3]PETERS M E,NEUMANN M,IYYER M,et al.Deep contex-tualized word representations[C]//Proceedings of NAACL-HLT.New Orleans:NAACL,2018:2227-2237. [4]SUN Z,WANG H L.Summary of research progress in named entity recognition [J].New Technology of Library and Information Service,2010,26 (6):42-47. [5]LI L,MAO T,HUANG D,et al.Hybrid models for Chinesenamed entity recognition [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:72-78. [6]ZHANG S,QIN Y,WEN J,et al.Word segmentation and named entity recognition for SIGHAN Bakeoff3[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:158-161. [7]ZHOU J,HE L,DAI X,et al.Chinese named entity recognition with a multi-phase model[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:213-216. [8]HAMMERTON J.Named entity recognition with long short-term memory[C]//Proceedings of the seventh conference on Natural language learning at HLT-NAACL.Association for Computational Linguistics.Sapporo:ACL,2003:172-175. [9]HUANG Z,XU W,YU K.Bidirectional LSTM-CRF Models for Sequence Tagging [J].arXiv:1508.01991. [10]MA X,HOVY E.End-to-end sequence labeling via bi-directional lstm-cnns-crf [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin:ACL,2016:7-12. [11]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural Architectures for Named Entity Recognition [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.Berlin:ACL,2016:260-270. [12]ZHANG Y,YANG J.Chinese NER Using Lattice LSTM [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne:ACL,2018:1554-1564. [13]BENGIO Y,DUCHARME R,VINCENT P,et al.A neuralprobabilistic language model [J].Journal of Machine Learning Research,2003,3(2):1137-1155. [14]HE J,WANG H.Chinese named entity recognition and wordsegmentation based on character [C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.Columbus:ACL,2008:128-132. [15]LIU Z,ZHU C,ZHAO T.Chinese named entity recognitionwith a sequence labeling approach:based on characters,or based on words? [C]//International Conference on Intelligent Computing.Berlin:Springer,2010:634-640. [16]LI H,HAGIWARA M,LI Q,et al.Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese [C]//Proceedings of the Ninth International Conference on Language Resources and Evaluation.Reykjavik:LREC,2014:2532-2536. [17]CHE W,LIU Y,WANG Y,et al.Towards Better UD Parsing:Deep Contextualized Word Embeddings,Ensemble,and Treebank Concatenation [C]//Proceedings of the CoNLL 2018 Shared Task:Multilingual Parsing from Raw Text to Universal Dependencies.Melbourne:ACL,2018:55-64. [18]HOCHREITER S,SCHMIDHUBER J.Long short-term memory [J].Neural Computation,1997,9(8):1735-1780. [19]LING W,LUÍS T,MARUJO L,et al.Finding function in form:Compositional character models for open vocabulary word representation [C]//Proceeding of the 2015 Conference on Empirical Methods in Natural Language Processing.Lisbon:ACL,2015:1520-1530. [20]LAFFERTY J D,MCCALLUM A,PEREIRA F C N.Condi-tional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:282-289. [21]CHEN A,PENG F,SHAN R,et al.Chinese named entity recognition with conditional probabilistic models [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:173-176. [22]CHEN W,ZHANG Y,ISAHARA H.Chinese named entity reco-gnition with conditional random fields [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:118-121. [23]XU J,WEN J,SUN X,et al.A discourse-level named entity reco-gnition and relation extraction dataset for Chinese literature text [J].CoRR,2017,11(7):100-104. [24]ZHANG S,QIN Y,WEN J,et al.Word segmentation and named entity recognition for SIGHAN Bakeoff3 [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:158-161. [25]YANG F,ZHANG J,LIU G,et al.Five-Stroke Based CNN-BiRNN-CRF Network for Chinese Named Entity Recognition [C]//CCF International Conference on Natural Language Processing and Chinese Computing.Hohhot:Springer,2018:184-195. [26]ZHOU J,QU W,ZHANG F.Chinese named entity recognition via joint identification and categorization [J].Chinese journal of electronics,2013,22(2):225-230. |
[1] | 杜晓明, 袁清波, 杨帆, 姚奕, 蒋祥. 军事指控保障领域命名实体识别语料库的构建 Construction of Named Entity Recognition Corpus in Field of Military Command and Control Support 计算机科学, 2022, 49(6A): 133-139. https://doi.org/10.11896/jsjkx.210400132 |
[2] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[3] | 潘芳, 张会兵, 董俊超, 首照宇. 基于高效Transformer的中文在线课程评论方面情感分析 Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer 计算机科学, 2021, 48(6A): 264-269. https://doi.org/10.11896/jsjkx.200800116 |
[4] | 董哲, 邵若琦, 陈玉梁, 翟维枫. 基于BERT和对抗训练的食品领域命名实体识别 Named Entity Recognition in Food Field Based on BERT and Adversarial Training 计算机科学, 2021, 48(5): 247-253. https://doi.org/10.11896/jsjkx.200800181 |
[5] | 丁玲, 向阳. 基于分层次多粒度语义融合的中文事件检测 Chinese Event Detection with Hierarchical and Multi-granularity Semantic Fusion 计算机科学, 2021, 48(5): 202-208. https://doi.org/10.11896/jsjkx.200800038 |
[6] | 余诗媛, 郭淑明, 黄瑞阳, 张建朋, 苏珂. 嵌套命名实体识别研究进展 Overview of Nested Named Entity Recognition 计算机科学, 2021, 48(11A): 1-10. https://doi.org/10.11896/jsjkx.201100165 |
[7] | 邹傲, 郝文宁, 靳大尉, 陈刚, 田媛. 基于预训练和深度哈希的大规模文本检索研究 Study on Text Retrieval Based on Pre-training and Deep Hash 计算机科学, 2021, 48(11): 300-306. https://doi.org/10.11896/jsjkx.210300266 |
[8] | 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述 Survey of Natural Language Processing Pre-training Techniques 计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167 |
[9] | 唐国强,高大启,阮彤,叶琪,王祺. 融入语言模型和注意力机制的临床电子病历命名实体识别 Clinical Electronic Medical Record Named Entity Recognition Incorporating Language Model and Attention Mechanism 计算机科学, 2020, 47(3): 211-216. https://doi.org/10.11896/jsjkx.190200259 |
[10] | 崔丹丹, 刘秀磊, 陈若愚, 刘旭红, 李臻, 齐林. 基于Lattice LSTM的古汉语命名实体识别 Named Entity Recognition in Field of Ancient Chinese Based on Lattice LSTM 计算机科学, 2020, 47(11A): 18-23. https://doi.org/10.11896/jsjkx.200500090 |
[11] | 石春丹, 秦岭. 基于BGRU-CRF的中文命名实体识别方法 Chinese Named Entity Recognition Method Based on BGRU-CRF 计算机科学, 2019, 46(9): 237-242. https://doi.org/10.11896/j.issn.1002-137X.2019.09.035 |
[12] | 王子牛, 姜猛, 高建瓴, 陈娅先. 基于BERT的中文命名实体识别方法 Chinese Named Entity Recognition Method Based on BERT 计算机科学, 2019, 46(11A): 138-142. |
[13] | 张献, 贲可荣. 改进的神经语言模型及其在代码提示中的应用 Modified Neural Language Model and Its Application in Code Suggestion 计算机科学, 2019, 46(11): 168-175. https://doi.org/10.11896/jsjkx.191100504C |
[14] | 张景,朱国宾. 基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究 Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model 计算机科学, 2018, 45(4): 208-214. https://doi.org/10.11896/j.issn.1002-137X.2018.04.035 |
[15] | 张爱英,倪崇嘉. 资源稀缺蒙语语音识别研究 Research on Low-resource Mongolian Speech Recognition 计算机科学, 2017, 44(10): 318-322. https://doi.org/10.11896/j.issn.1002-137X.2017.10.057 |
|