基于字词融合的低词汇信息损失中文命名实体识别方法

doi:10.11896/jsjkx.230500047

计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 272-280.doi: 10.11896/jsjkx.230500047

基于字词融合的低词汇信息损失中文命名实体识别方法

郭志强, 关东海, 袁伟伟

南京航空航天大学计算机科学与技术学院南京 211106

收稿日期:2023-05-08 修回日期:2023-08-30 出版日期:2024-08-15 发布日期:2024-08-13
通讯作者: 关东海(dhguan@nuaa.edu.cn)
作者简介:(529942688@qq.com)
基金资助:
航空基金(ASFC-20200055052005)

Word-Character Model with Low Lexical Information Loss for Chinese NER

GUO Zhiqiang, GUAN Donghai, YUAN Weiwei

School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China

Received:2023-05-08 Revised:2023-08-30 Online:2024-08-15 Published:2024-08-13
About author:GUO Zhiqiang,born in 1998,postgra-duate.His main research interests is knowledge graph.
GUAN Donghai,born in 1981,Ph.D,associate professor,graduate supervisor.His main research interests include data mining,knowledge inference,etc.
Supported by:
Aviation Fundation(ASFC-20200055052005).

摘要/Abstract

摘要： 中文命名实体识别(CNER)任务是一种自然语言处理技术,旨在识别文本中具有特定类别的实体,如人名、地名、组织机构名等,它是问答系统、机器翻译、信息抽取等自然语言应用的基础底层任务。由于中文不具备类似英文这样的天然分词结构,基于词的NER模型在中文命名实体识别上的效果会因分词错误而显著降低,基于字符的NER模型又忽略了词汇信息的作用,因此,近年来许多研究开始尝试将词汇信息融入字符模型中。WC-LSTM通过在词汇的开始字符和结束字符中注入词汇信息,使模型性能获得了显著的提升。然而,该模型依然没有充分利用词汇信息,因此在其基础上提出了基于字词融合的低词汇信息损失NER模型LLL-WCM,对词汇的所有中间字符融入词汇信息,避免了词汇信息损失。同时,引入了两种编码策略平均(avg)和自注意力机制(self-attention)以提取所有词汇信息。在4个中文数据集上进行实验,结果表明,与WC-LSTM相比,该方法的F1值分别提升了1.89%,0.29%,1.10%和1.54%。

关键词: 命名实体识别, 自然语言处理, 词汇信息损失, 中间字符, 编码策略

Abstract: Chinese named entity recognition(CNER) task is a natural language processing technique that aims to recognize entities with specific categories in text,such as names of people,places,organizations.It is a fundamental underlying task of natural language applications such as question and answer systems,machine translation,and information extraction.Since Chinese does not have a natural word separation structure like English,the effectiveness of word-based NER models for Chinese named entity recognition is significantly reduced by word separation errors,and character-based NER models ignore the role of lexical information.In recent years,many studies have attempted to incorporate lexical information into character-based models,and WC-LSTM has achieved significant improvements in model performance by injecting lexical information into the start and end characters of a word.However,this model still does not fully utilize lexical information,so based on it,LLL-WCM(word-character model with low lexical information loss) is proposed to incorporate lexical information for all intermediate characters of the lexicon to avoid lexical information loss.Meanwhile,two encoding strategies average and self-attention mechanism are introduced to extract all lexical information.Experiments are conducted on four Chinese datasets,and the results show that the F1 values of this method are improved by 1.89%,0.29%,1.10% and 1.54%,respectively,compared with WC-LSTM.

Key words: Named entity recognition, Natural language processing, Lexical information loss, Intermediate characters, Encoding strategy

中图分类号:

TP391

郭志强, 关东海, 袁伟伟. 基于字词融合的低词汇信息损失中文命名实体识别方法[J]. 计算机科学, 2024, 51(8): 272-280. https://doi.org/10.11896/jsjkx.230500047

GUO Zhiqiang, GUAN Donghai, YUAN Weiwei. Word-Character Model with Low Lexical Information Loss for Chinese NER[J]. Computer Science, 2024, 51(8): 272-280. https://doi.org/10.11896/jsjkx.230500047

参考文献

[1]YU L,GUO Z,CHEN G,et al.Review of Knowledge Extraction Technology for Knowledge Graph Construction[J].Journal of the University of Information Engineering,2020,21(2):227-235.
[2]ZHONG S S,CHEN X,ZHAO M H,et al.Incorporating word-set attention into Chinese named entity recognition Method[J].Journal of Jilin University(Engineering and Technology Edition),2022,52(5):1098-1105.
[3]CHEN Y,XU L,LIU K,et al.Event extraction via dynamic multi-pooling convolutional neural networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2015:167-176.
[4]DIEFENBACH D,LOPEZ V,SINGH K,et al.Core techniques of question answering systems over knowledge bases:a survey[J].Knowledge and Information systems,2018,55:529-569.
[5]SAITO K,NAGATA M.Multi-language named-entity recognition system based on HMM[C]//Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition.2003:41-48.
[6]CHIEU H L,NG H T.Named entity recognition with a maximum entropy approach[C]//Proceedings of the Seventh Confe-rence on Natural Language Learning at HLT-NAACL 2003.2003:160-163.
[7]EKBAL A,BANDYOPADHYAY S.Named entity recognitionusing support vector machine:A language independent approach[J].International Journal of Electrical and Computer Enginee-ring,2010,4(3):589-604.
[8]FENG Y,SUN L,LV Y.Chinese word segmentation and named entity recognition based on conditional random fields models[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.2006:181-184.
[9]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural Architectures for Named Entity Recognition[C]//Proceedings of NAACL-HLT.2016:260-270.
[10]YANG J,TENG Z,ZHANG M,et al.Combining discrete and neural features for sequence labeling[C]//Computational Linguistics and Intelligent Text Processing:17th International Conference,CICLing 2016.Springer International Publishing,2018:140-154.
[11]HE J,WANG H.Chinese named entity recognition and wordsegmentation based on character[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.2008.
[12]ZHANG Y,YANG J.Chinese NER Using Lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics(Volume 1:Long Papers).2018:1554-1564.
[13]LIU W,XU T,XU Q,et al.An encoding strategy based word-character LSTM for Chinese NER[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(Long and Short Papers).2019:2379-2389.
[14]COLLOBERT R,WESTON J.A unified architecture for natural language processing:Deep neural networks with multitask lear-ning[C]//Proceedings of the 25th International Conference on Machine Learning.2008:160-167.
[15]CHEN X,QIU X,ZHU C,et al.Long short-term memory neural networks for chinese word segmentation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.2015:1197-1206.
[16]ZHANG B,CAI J,ZHANG H,et al.VisPhone:Chinese named entity recognition model enhanced by visual and phonetic features[J].Information Processing & Management,2023,60(3):103314.
[17]MAI C,LIU J,QIU M,et al.Pronounce differently,mean diffe-rently:A multi-tagging-scheme learning method for Chinese NER integrated with lexicon and phonetic features[J].Information Processing & Management,2022,59(5):103041.
[18]DONG C,ZHANG J,ZONG C,et al.Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C]//Natural Language Understanding and Intelligent Applications:5th CCF Conference on Natural Language Processing and Chinese Computing,NLPCC 2016,and 24th International Conference on Computer Processing of Oriental Languages,ICCPOL 2016.Springer International Publishing,2016:239-250.
[19]KENTON J D M W C,TOUTANOVA L K.Bert:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of naacL-HLT.2019:4171-4186.
[20]LI J,FEI H,LIU J,et al.Unified named entity recognition as word-word relation classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022,36(10):10965-10973.
[21]LI L,DAI Y,TANG D,et al.MarkBERT:Marking WordBoundaries Improves Chinese BERT[J].arXiv:2203.06378,2022.
[22]GU Y,QU X,WANG Z,et al.Delving Deep into Regularity:A Simple but Effective Method for Chinese Named Entity Recognition[C]//Findings of the Association for Computational Linguistics:NAACL 2022.2022:1863-1873.
[23]PENG N,DREDZE M.Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers).2016:149-155.
[24]LIU W,FU X,ZHANG Y,et al.Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:5847-5858.
[25]WANG B,ZHANG Z,XU K,et al.DyLex:Incorporating Dynamic Lexicons into BERT for Sequence Labeling[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:2679-2693.
[26]MENGGE X,YU B,LIU T,et al.Porous lattice transformer en-coder for Chinese NER[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:3831-3841.
[27]HUANG S,SHA Y,LI R.A chinese named entity recognition method for small-scale dataset based on lexicon and unlabeled data[J].Multimedia Tools and Applications,2023,82(2):2185-2206.
[28]LIN Z,FENG M,DOS SANTOS C,et al.A structured self-attentive sentence embedding[C]//International Conference on Learning Representations(ICLR).2017.
[29]PENG N,DREDZE M.Named entity recognition for chinese social media with jointly trained embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.2015:548-554.
[30]HE H,SUN X.A unified model for cross-domain and semi-supervised named entity recognition in chinese social media[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2017.
[31]HE H,SUN X.F-Score Driven Max Margin Neural Network for Named Entity Recognition in Chinese Social Media[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics:Volume 2,Short Papers.2017:713-718.
[32]CHEN A,PENG F,SHAN R,et al.Chinese named entity recognition with conditional probabilistic models[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.2006:173-176.
[33]ZHANG S,QIN Y,HOU W J,et al.Word segmentation andnamed entity recognition for sighan bakeoff3[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Proces-sing.2006:158-161.
[34]ZHOU J,QU W,ZHANG F.Chinese named entity recognition via joint identification and categorization[J].Chinese Journal of Electronics,2013,22(2):225-230.
[35]LU Y,ZHANG Y,JI D.Multi-prototype Chinese character embedding[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC’16).2016:855-859.
[36]CAO P,CHEN Y,LIU K,et al.Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:182-192.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于字词融合的低词汇信息损失中文命名实体识别方法

Word-Character Model with Low Lexical Information Loss for Chinese NER

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0