计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 233-238.doi: 10.11896/jsjkx.191200074

• 人工智能 • 上一篇    下一篇

基于上下文相关字向量的中文命名实体识别

张栋, 陈文亮   

  1. 苏州大学计算机科学与技术学院 江苏 苏州215006
  • 收稿日期:2019-12-09 修回日期:2020-05-28 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 陈文亮(wlchen@suda.edu.cn)
  • 作者简介:dzhang19@stu.suda.edu.cn
  • 基金资助:
    国家自然科学基金(61876115)

Chinese Named Entity Recognition Based on Contextualized Char Embeddings

ZHANG Dong, CHEN Wen-liang   

  1. School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2019-12-09 Revised:2020-05-28 Online:2021-03-15 Published:2021-03-05
  • About author:ZHANG Dong,born in 1992,postgra-duate,is a member of China Computer Federation.His main research interests include natural language processing and named entity recognition.
    CHEN Wen-liang,born in 1977,professor,doctoral supervisor,is a member of China Computer Federation.His main research interests include natural language understanding,information extraction and knowledge graph.
  • Supported by:
    National Natural Science Foundation of China(61876115).

摘要: 命名实体识别(NER)旨在识别出文本中的专有名词,并对其进行分类。由于用于监督学习的训练数据通常由人工标注,耗时耗力,因此很难得到大规模的标注数据。为解决中文命名实体识别任务中因缺乏大规模标注语料而造成的数据稀缺问题,以及传统字向量不能解决的一字多义问题,文中使用在大规模无监督数据上预训练的基于上下文相关的字向量,即利用语言模型生成上下文相关字向量以改进中文NER模型的性能。同时,为解决命名实体识别中的未登录词问题,文中提出了基于字语言模型的中文NER系统。把语言模型学习到的字向量作为NER模型的输入,使得同一中文汉字在不同语境中有不同的表示。文中在6个中文NER数据集上进行了实验。实验结果表明,基于上下文相关的字向量可以很好地提升NER模型的性能,其平均性能F1值提升了4.95%。对实验结果进行进一步分析发现,新系统在OOV实体识别上也可以取得很好的效果,同时对一些特殊类型的中文实体识别也有不错的表现。

关键词: 命名实体识别, 上下文相关字向量, 语言模型

Abstract: Named Entity Recognition (NER) is designed to identify and classify proper nouns in text.Training data for supervised learning are usually manually annotated,and it is difficult to obtain large-scale annotated data due to time-consuming and labor-intensive.In order to solve the problem of data sparseness caused by the lack of large-scale annotation corpus and the problem of polysemy of charembeddingin the Chinese NER task,this paper uses contextualized char embeddings which is pre-trained on large-scale unlabeled data to improve the performance of the Chinese NER model.Furthermore,to solve the problem of out-of-vocabulary words in named entity recognition,this paper proposes a Chinese NER system based on word language model.We use the contextualized char embeddings of generated by the language model as the input of the NER model to capture different mea-nings of Chinese characters in different contexts.In this paper,we conduct experiments on six Chinese NER datasets.The experimental results show that the proposed model can improve the performance and the average F1 improves by 4.95%.In addition,this paper further analyzes the experimental results and finds that the proposed model can achieve better results on OOV entities,and it has good performance for some special types of Chinese entity recognition.

Key words: Contextualized char vector, Language model, Named entity recognition

中图分类号: 

  • TP391.1
[1]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (almost) from scratch [J].Journal of Machine Learning Research,2011,12(8):2493-2537.
[2]MIKOLOU T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality [C]//Proc of NIPS.Cambridge:MIT Press,2013:3111-3119.
[3]PETERS M E,NEUMANN M,IYYER M,et al.Deep contex-tualized word representations[C]//Proceedings of NAACL-HLT.New Orleans:NAACL,2018:2227-2237.
[4]SUN Z,WANG H L.Summary of research progress in named entity recognition [J].New Technology of Library and Information Service,2010,26 (6):42-47.
[5]LI L,MAO T,HUANG D,et al.Hybrid models for Chinesenamed entity recognition [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:72-78.
[6]ZHANG S,QIN Y,WEN J,et al.Word segmentation and named entity recognition for SIGHAN Bakeoff3[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:158-161.
[7]ZHOU J,HE L,DAI X,et al.Chinese named entity recognition with a multi-phase model[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:213-216.
[8]HAMMERTON J.Named entity recognition with long short-term memory[C]//Proceedings of the seventh conference on Natural language learning at HLT-NAACL.Association for Computational Linguistics.Sapporo:ACL,2003:172-175.
[9]HUANG Z,XU W,YU K.Bidirectional LSTM-CRF Models for Sequence Tagging [J].arXiv:1508.01991.
[10]MA X,HOVY E.End-to-end sequence labeling via bi-directional lstm-cnns-crf [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin:ACL,2016:7-12.
[11]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural Architectures for Named Entity Recognition [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.Berlin:ACL,2016:260-270.
[12]ZHANG Y,YANG J.Chinese NER Using Lattice LSTM [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne:ACL,2018:1554-1564.
[13]BENGIO Y,DUCHARME R,VINCENT P,et al.A neuralprobabilistic language model [J].Journal of Machine Learning Research,2003,3(2):1137-1155.
[14]HE J,WANG H.Chinese named entity recognition and wordsegmentation based on character [C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.Columbus:ACL,2008:128-132.
[15]LIU Z,ZHU C,ZHAO T.Chinese named entity recognitionwith a sequence labeling approach:based on characters,or based on words? [C]//International Conference on Intelligent Computing.Berlin:Springer,2010:634-640.
[16]LI H,HAGIWARA M,LI Q,et al.Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese [C]//Proceedings of the Ninth International Conference on Language Resources and Evaluation.Reykjavik:LREC,2014:2532-2536.
[17]CHE W,LIU Y,WANG Y,et al.Towards Better UD Parsing:Deep Contextualized Word Embeddings,Ensemble,and Treebank Concatenation [C]//Proceedings of the CoNLL 2018 Shared Task:Multilingual Parsing from Raw Text to Universal Dependencies.Melbourne:ACL,2018:55-64.
[18]HOCHREITER S,SCHMIDHUBER J.Long short-term memory [J].Neural Computation,1997,9(8):1735-1780.
[19]LING W,LUÍS T,MARUJO L,et al.Finding function in form:Compositional character models for open vocabulary word representation [C]//Proceeding of the 2015 Conference on Empirical Methods in Natural Language Processing.Lisbon:ACL,2015:1520-1530.
[20]LAFFERTY J D,MCCALLUM A,PEREIRA F C N.Condi-tional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:282-289.
[21]CHEN A,PENG F,SHAN R,et al.Chinese named entity recognition with conditional probabilistic models [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:173-176.
[22]CHEN W,ZHANG Y,ISAHARA H.Chinese named entity reco-gnition with conditional random fields [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:118-121.
[23]XU J,WEN J,SUN X,et al.A discourse-level named entity reco-gnition and relation extraction dataset for Chinese literature text [J].CoRR,2017,11(7):100-104.
[24]ZHANG S,QIN Y,WEN J,et al.Word segmentation and named entity recognition for SIGHAN Bakeoff3 [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney:ACL,2006:158-161.
[25]YANG F,ZHANG J,LIU G,et al.Five-Stroke Based CNN-BiRNN-CRF Network for Chinese Named Entity Recognition [C]//CCF International Conference on Natural Language Processing and Chinese Computing.Hohhot:Springer,2018:184-195.
[26]ZHOU J,QU W,ZHANG F.Chinese named entity recognition via joint identification and categorization [J].Chinese journal of electronics,2013,22(2):225-230.
[1] 杜晓明, 袁清波, 杨帆, 姚奕, 蒋祥.
军事指控保障领域命名实体识别语料库的构建
Construction of Named Entity Recognition Corpus in Field of Military Command and Control Support
计算机科学, 2022, 49(6A): 133-139. https://doi.org/10.11896/jsjkx.210400132
[2] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[3] 潘芳, 张会兵, 董俊超, 首照宇.
基于高效Transformer的中文在线课程评论方面情感分析
Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer
计算机科学, 2021, 48(6A): 264-269. https://doi.org/10.11896/jsjkx.200800116
[4] 董哲, 邵若琦, 陈玉梁, 翟维枫.
基于BERT和对抗训练的食品领域命名实体识别
Named Entity Recognition in Food Field Based on BERT and Adversarial Training
计算机科学, 2021, 48(5): 247-253. https://doi.org/10.11896/jsjkx.200800181
[5] 丁玲, 向阳.
基于分层次多粒度语义融合的中文事件检测
Chinese Event Detection with Hierarchical and Multi-granularity Semantic Fusion
计算机科学, 2021, 48(5): 202-208. https://doi.org/10.11896/jsjkx.200800038
[6] 余诗媛, 郭淑明, 黄瑞阳, 张建朋, 苏珂.
嵌套命名实体识别研究进展
Overview of Nested Named Entity Recognition
计算机科学, 2021, 48(11A): 1-10. https://doi.org/10.11896/jsjkx.201100165
[7] 邹傲, 郝文宁, 靳大尉, 陈刚, 田媛.
基于预训练和深度哈希的大规模文本检索研究
Study on Text Retrieval Based on Pre-training and Deep Hash
计算机科学, 2021, 48(11): 300-306. https://doi.org/10.11896/jsjkx.210300266
[8] 李舟军,范宇,吴贤杰.
面向自然语言处理的预训练技术研究综述
Survey of Natural Language Processing Pre-training Techniques
计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167
[9] 唐国强,高大启,阮彤,叶琪,王祺.
融入语言模型和注意力机制的临床电子病历命名实体识别
Clinical Electronic Medical Record Named Entity Recognition Incorporating Language Model and Attention Mechanism
计算机科学, 2020, 47(3): 211-216. https://doi.org/10.11896/jsjkx.190200259
[10] 崔丹丹, 刘秀磊, 陈若愚, 刘旭红, 李臻, 齐林.
基于Lattice LSTM的古汉语命名实体识别
Named Entity Recognition in Field of Ancient Chinese Based on Lattice LSTM
计算机科学, 2020, 47(11A): 18-23. https://doi.org/10.11896/jsjkx.200500090
[11] 石春丹, 秦岭.
基于BGRU-CRF的中文命名实体识别方法
Chinese Named Entity Recognition Method Based on BGRU-CRF
计算机科学, 2019, 46(9): 237-242. https://doi.org/10.11896/j.issn.1002-137X.2019.09.035
[12] 王子牛, 姜猛, 高建瓴, 陈娅先.
基于BERT的中文命名实体识别方法
Chinese Named Entity Recognition Method Based on BERT
计算机科学, 2019, 46(11A): 138-142.
[13] 张献, 贲可荣.
改进的神经语言模型及其在代码提示中的应用
Modified Neural Language Model and Its Application in Code Suggestion
计算机科学, 2019, 46(11): 168-175. https://doi.org/10.11896/jsjkx.191100504C
[14] 张景,朱国宾.
基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究
Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model
计算机科学, 2018, 45(4): 208-214. https://doi.org/10.11896/j.issn.1002-137X.2018.04.035
[15] 张爱英,倪崇嘉.
资源稀缺蒙语语音识别研究
Research on Low-resource Mongolian Speech Recognition
计算机科学, 2017, 44(10): 318-322. https://doi.org/10.11896/j.issn.1002-137X.2017.10.057
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!