基于边界定位与纠偏的中文命名实体提取规则研究

doi:10.11896/jsjkx.220200020

计算机科学 ›› 2023, Vol. 50 ›› Issue (3): 276-281.doi: 10.11896/jsjkx.220200020

基于边界定位与纠偏的中文命名实体提取规则研究

刘盼¹, 郭延明¹, 雷军¹, 老明瑞², 李国辉¹

1 国防科技大学系统工程学院长沙 410000
2 莱顿大学LIACS媒体实验室莱顿 2333CA

收稿日期:2022-02-01 修回日期:2022-05-13 出版日期:2023-03-15 发布日期:2023-03-15
通讯作者: 郭延明(guoyanming@nudt.edu.cn)
作者简介:(liupan09@nudt.edu.cn)
基金资助:
国家自然科学基金(61806218,71673293);湖南省自然科学基金(2019JJ50722)

Study on Chinese Named Entity Extraction Rules Based on Boundary Location and Correction

LIU Pan¹, GUO Yanming¹, LEI Jun¹, LAO Mingrui², LI Guohui¹

1 College of Systems Engineering,National University of Defense Technology,Changsha 410000,China
2 LIACS Media Lab,Leiden University,Leiden 2333CA,The Netherlands

Received:2022-02-01 Revised:2022-05-13 Online:2023-03-15 Published:2023-03-15
About author:LIU Pan,born in 1990,postgraduate.His main research interests include na-tural language processing,computer vision and deep learning.
GUO Yanming,born in 1989,Ph.D,associate professor.His main research interests include computer vision,natural language processing and deep learning.
Supported by:
National Natural Science Foundation of China(61806218,71673293) and Natural Science Foundation of Hunan Province,China(2019JJ50722).

摘要/Abstract

摘要： 相对于英文天然由单词组成而言,中文由于没有分词符,汉字之间的组词更灵活,在命名实体识别时,其边界更加难以确定。当前的主流方法将命名实体识别任务转化为序列标注任务,文中采用BIOES标注方案,针对预测的标签序列进行研究。通过单独比较实体头部标签B或尾部标签E,计算实体边界准确率,结果表明提高边界准确率能够进一步提升实体识别准确率;对具有连续标签的实体边界进行拓展和重定位,采用实体最后一个字符的类型标签对实体类型进行纠偏,利用分词信息对标签不完整的实体进行填充;最后,提出增加边界标记的BIO⁺ES标注方案,用于区分实体边界的非实体字符,以进一步提升中文命名实体识别的性能。

关键词: 中文命名实体识别, 标注方案, 实体提取

Abstract: Compared with English text which is naturally composed of words,Chinese text has no word delimiters,so the combination of Chinese characters is more flexible,and it's more difficult to determine the entity boundaries in Chinese named entity recognition(NER).Current mainstream methods transform the NER task into a sequence labeling task.This paper studies the predicted label sequence under the BIOES tag scheme and calculates the entity boundary accuracy by separately considering the entity head label B or tail label E,which shows that increasing the boundary accuracy can further improve the accuracy of entity recognition.We expand the boundaries of entities with continuous labels,use the label type of the last character of the entity to correct the entity type,and use the word segmentation information to fill in the entity with incomplete labels.Finally,this paper proposes a BIO⁺ES labeling scheme that adds boundary labels to distinguish non-entity characters at entity boundaries and further improves the performance of Chinese NER.

Key words: Chinese named entity recognition, Tag scheme, Entity extraction

中图分类号:

TP391

刘盼, 郭延明, 雷军, 老明瑞, 李国辉. 基于边界定位与纠偏的中文命名实体提取规则研究[J]. 计算机科学, 2023, 50(3): 276-281. https://doi.org/10.11896/jsjkx.220200020

LIU Pan, GUO Yanming, LEI Jun, LAO Mingrui, LI Guohui. Study on Chinese Named Entity Extraction Rules Based on Boundary Location and Correction[J]. Computer Science, 2023, 50(3): 276-281. https://doi.org/10.11896/jsjkx.220200020

参考文献

[1]PENG N,DREDZE M.Named entity recognition for chinese social media with jointly trained embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.2015:548-554.
[2]UCHIMOTO K,MA Q,MURATA M,et al.Named entity ex-traction based on a maximum entropy model and transformation rules[C]//Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics.2000:326-335.
[3]RAMSHAW L A,MARCUS M P.Text chunking using transformation-based learning[M]//Natural Language Processing Using Very Large Corpora.Springer,Dordrecht,1999:157-176.
[4]RATNAPARKHI A.Maximum entropy models for natural lan-guage ambiguity resolution[D].Philadelphia:University of Pennsylvania,1998.
[5]VEENSTRA J,SANG E F T K.Representing Text Chunks[C]//Proceedings of the NinthConference of the European Chapter of the Association for Computational Linguistics(EACL’99).Association for Computational Linguistics,1999:173-179.
[6]RATINOV L,ROTH D.Design challenges and misconceptions in named entity recognition[C]//Proceedings of the Thirteenth Conference on Computational Natural Language Learning(CoNLL-2009).2009:147-155.
[7]TKACHENKO A,PETMANSON T,LAUR S.Named entityrecognition in estonian[C]//Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing.2013:78-83.
[8]MALIK M K,SARWAR S M.Named entity recognition system for postpositional languages:urdu as a case study[J].International Journal of Advanced Computer Science and Applications,2016,7(10):141-147.
[9]REIMERS N,GUREVYCH I.Optimal Hyperparameters forDeep LSTM-Networks for Sequence Labeling Tasks[J].arXiv:1707.06799,2017.
[10]YANG J,LIANG S,ZHANG Y.Design Challenges and Misconceptions in Neural Sequence Labeling[C]//Proceedings of the 27th International Conference on Computational Linguistics.2018:3879-3889.
[11]LIU P,GUO Y,WANG F,et al.Chinese named entity recognition:The state of the art[J].Neurocomputing,2022,473:37-53.
[12]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[13]SUN Y,WANG S,LI Y,et al.ERNIE:Enhanced Representation through Knowledge Integration[J].arXiv:1904.09223,2019.
[14]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[15]LAFFERTY J D,MCCALLUM A,PEREIRA F C N.Condi-tional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//ICML.2001.
[16]SEHANOBISH A,SONG C H.Using Chinese Glyphs forNamed Entity Recognition[J].arXiv:1909.09922,2019.
[17]MENG Y,WU W,WANG F,et al.Glyce:glyph-vectors for chinese character representations[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:2746-2757.
[18]LI X,YAN H,QIU X,et al.FLAT:Chinese NER Using Flat-Lattice Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:6836-6842.
[19]MA R,PENG M,ZHANG Q,et al.Simplify the Usage of Lexicon in Chinese NER[C]//Proceedings of the 58th Annual Mee-ting of the Association for Computational Linguistics.2020:5951-5960.

相关文章 15

[1]	李帅, 徐彬, 韩祎珂, 廖同鑫. SS-GCN:情感增强和句法增强的方面级情感分析模型 SS-GCN:Aspect-based Sentiment Analysis Model with Affective Enhancement and Syntactic Enhancement 计算机科学, 2023, 50(3): 3-11. https://doi.org/10.11896/jsjkx.220700238
[2]	汪璟玢, 赖晓连, 林新宇, 杨心逸. 基于关系约束的上下文感知时态知识图谱补全 Context-aware Temporal Knowledge Graph Completion Based on Relation Constraints 计算机科学, 2023, 50(3): 23-33. https://doi.org/10.11896/jsjkx.220400255
[3]	陈富强, 寇嘉敏, 苏利敏, 李克. 基于图神经网络的多信息优化实体对齐模型 Multi-information Optimized Entity Alignment Model Based on Graph Neural Network 计算机科学, 2023, 50(3): 34-41. https://doi.org/10.11896/jsjkx.220700242
[4]	邓亮, 齐攀虎, 刘振龙, 李敬鑫, 唐积强. BGPNRE:一种基于BERT的全局指针网络实体关系联合抽取方法 BGPNRE:A BERT-based Global Pointer Network for Named Entity-Relation Joint Extraction Method 计算机科学, 2023, 50(3): 42-48. https://doi.org/10.11896/jsjkx.220600239
[5]	李志飞, 赵月, 张龑. 基于表示学习的知识图谱推理研究综述 Survey of Knowledge Graph Reasoning Based on Representation Learning 计算机科学, 2023, 50(3): 94-113. https://doi.org/10.11896/jsjkx.220900136
[6]	饶丹, 时宏伟. 基于深度聚类的航空交通流识别与异常检测研究 Study on Air Traffic Flow Recognition and Anomaly Detection Based on Deep Clustering 计算机科学, 2023, 50(3): 121-128. https://doi.org/10.11896/jsjkx.220100086
[7]	段顺然, 尹美娟, 刘粉林, 焦隆隆, 于岚岚. 一种基于影响力预测的节点排序模型 Nodes’ Ranking Model Based on Influence Prediction 计算机科学, 2023, 50(3): 155-163. https://doi.org/10.11896/jsjkx.211200261
[8]	董永峰, 黄港, 薛婉若, 李林昊. 融合IRT的图注意力深度知识追踪模型 Graph Attention Deep Knowledge Tracing Model Integrated with IRT 计算机科学, 2023, 50(3): 173-180. https://doi.org/10.11896/jsjkx.211200134
[9]	梅鹏程, 杨吉斌, 张强, 黄翔. 一种基于三维卷积的声学事件联合估计方法 Sound Event Joint Estimation Method Based on Three-dimension Convolution 计算机科学, 2023, 50(3): 191-198. https://doi.org/10.11896/jsjkx.220500259
[10]	白雪飞, 马亚楠, 王文剑. 基于特征融合的边缘引导乳腺超声图像分割方法 Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion 计算机科学, 2023, 50(3): 199-207. https://doi.org/10.11896/jsjkx.211200294
[11]	刘航, 普园媛, 吕大华, 赵征鹏, 徐丹, 钱文华. 极化自注意力约束颜色溢出的图像自动上色 Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image 计算机科学, 2023, 50(3): 208-215. https://doi.org/10.11896/jsjkx.220100149
[12]	刘松岳, 王欢. 基于多粒度特征融合的叶片分类与分级方法 Leaf Classification and Ranking Method Based on Multi-granularity Feature Fusion 计算机科学, 2023, 50(3): 216-222. https://doi.org/10.11896/jsjkx.211100203
[13]	张卫良, 陈秀宏. 跨层融合和感受野扩增的SSD目标检测算法 SSD Object Detection Algorithm with Cross-layer Fusion and Receptive Field Amplification 计算机科学, 2023, 50(3): 231-237. https://doi.org/10.11896/jsjkx.211100281
[14]	陈亮, 王璐, 李生春, 刘昌宏. 基于深度学习的可视化仪表板生成技术研究 Study on Visual Dashboard Generation Technology Based on Deep Learning 计算机科学, 2023, 50(3): 238-245. https://doi.org/10.11896/jsjkx.230100064
[15]	张译, 吴秦. 特征增强损失与前景注意力人群计数网络 Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention 计算机科学, 2023, 50(3): 246-253. https://doi.org/10.11896/jsjkx.220100219

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于边界定位与纠偏的中文命名实体提取规则研究

Study on Chinese Named Entity Extraction Rules Based on Boundary Location and Correction

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0