基于多特征嵌入的中文医学命名实体识别

doi:10.11896/jsjkx.220400115

计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 243-250.doi: 10.11896/jsjkx.220400115

基于多特征嵌入的中文医学命名实体识别

黄健格¹, 贾真^1,2, 张凡^1,2, 李天瑞^1,2,3

1 西南交通大学计算机与人工智能学院成都 611756
2 四川省制造业产业链协同与信息化支撑技术重点实验室成都 611756
3 综合交通大数据应用技术国家工程实验室成都 611756

收稿日期:2022-04-11 修回日期:2022-09-15 出版日期:2023-06-15 发布日期:2023-06-06
通讯作者: 李天瑞(trli@swjtu.edu.cn)
作者简介:(hjgeuraka@163.com)
基金资助:
国家自然科学基金(62176221)

Chinese Medical Named Entity Recognition Based on Multi-feature Embedding

HUANG Jiange¹, JIA Zhen^1,2, ZHANG Fan^1,2, LI Tianrui^1,2,3

1 School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China
2 Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province,Chengdu 611756,China
3 National Engineering Laboratory of Integrated Transportation Big Data Application Technology,Chengdu 611756,China

Received:2022-04-11 Revised:2022-09-15 Online:2023-06-15 Published:2023-06-06
About author:HUANG Jiange,born in 1996,postgra-duate,is a member of China Computer Federation.His main research interests include named entity recognition and natural language processing.LI Tianrui,born in 1969,Ph.D,professor,Ph.D supervisor,is a distinguished member of China Computer Federation.His main research interests include big data intelligence,rough sets and granular computing.
Supported by:
National Natural Science Foundation of China(62176221).

摘要/Abstract

摘要： 针对基于字符表示的中文医学命名实体识别模型嵌入信息单一、缺失词边界和结构信息的问题,文中提出了一种融合多特征嵌入的医学命名实体识别模型。首先,将字符映射为固定长度的嵌入表示;其次,引入外部资源构建词汇特征,该特征能够补充字符的潜在词组信息;然后,根据中文的象形文字特点和文本序列特点,分别引入字符结构特征和序列结构特征,使用卷积神经网络对两种结构特征进行编码,得到radical-level词嵌入和sentence-level词嵌入;最后,将得到的多种特征嵌入进行拼接,输入长短期记忆网络编码,并使用条件随机场输出实体预测结果。将自建中文医疗数据和CHIP_2020任务提供的医疗数据作为数据集进行实验,实验结果表明,与基准模型相比,所提模型同时融合了词汇特征和文本结构特征,能够有效识别医学命名实体。

关键词: 命名实体识别, 中文医学文本, 词汇信息, 文本结构特征, 深度学习

Abstract: Aiming at the problems of single embedding information,lacking of word boundary and text structure information in Chinese medical named entity recognition(NER) model based on character representation,this paper presents a medical named entity recognition model integrating multi-feature embedding.Firstly,the characters are mapped to a fixed-length embedding representation.Secondly,external resources are introduced to construct lexical feature,which can supplement the potential phrase information of characters.Thirdly,according to the characteristics of Chinese pictographs and text sequences,character structure feature and sequence structure feature are introduced,respectively.The convolutional neural networks are used to encode the two structural features to obtain radial-level word embedding and sentence-level word embedding.Finally,the obtained multiple feature embeddings are concatenated and input into the long short-term memory network encoding,and the entity result is output by the CRF layer.Taking the self-built Chinese medical data and the CHIP_2020 data as the datasets,experimental results show that compared with the benchmark models,the proposed model integrating both lexical feature and text structure feature can effectivelyidentify named entities in the medical field.

Key words: Named entity recognition, Chinese medical text, Lexical information, Text structure features, Deep learning

中图分类号:

TP391

黄健格, 贾真, 张凡, 李天瑞. 基于多特征嵌入的中文医学命名实体识别[J]. 计算机科学, 2023, 50(6): 243-250. https://doi.org/10.11896/jsjkx.220400115

HUANG Jiange, JIA Zhen, ZHANG Fan, LI Tianrui. Chinese Medical Named Entity Recognition Based on Multi-feature Embedding[J]. Computer Science, 2023, 50(6): 243-250. https://doi.org/10.11896/jsjkx.220400115

参考文献

[1]CHO M,HA J,PARK C,et al.Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition[J].Journal of Biomedical Informatics,2020,103(1):1-8.
[2]WU F Z,LIU J X,WU C H,et al.Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation [C]//Proceedings of the World Wide Web Confe-rence.2019:3342-3348.
[3]YANG J,TENG Z Y,ZHANG M S,et al.Combining discreteand neural features for sequence labeling[C]//International Conference on Intelligent Text Processing and Computational Linguistics.Cham,Switzerland:Springer,2016:140-154.
[4]CUI B W,JIN T,WANG J M.Overview of information extraction of free-text electronic medical records[J].Journal of Computer Applications,2021,41(4):1055-1063.
[5]AZERAF E,MONFRINI E,VIGNON E,et al.Highly fast text segmentation with pairwise markov chains[C]//Proceedings of the 6th IEEE Congress on Information Science and Technology(CIST).NEW YORK:IEEE,2021:361-366.
[6]HARSHITHA C P,SUNITHAR N R.Topic identification for semantic grouping based on hidden markov model[C]//Procee-dings of the 5th International Conference on Communication and Electronics Systems(ICCES).NEW YORK:IEEE,2020:932-937.
[7]SONG S L,ZHANG N,HUANG H T.Named entity recognition based on conditional random fields[J].Cluster Computing,2019,22(3):5195-5206.
[8]GONG L J.ZHANG Z F.Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF[J].Chinese Journal of Engineering.2020,42(4):469-475.
[9]LIU S,HE T,DAI J.A survey of CRF algorithm based know-ledge extraction of elementary mathematics in Chinese[J].Mobile Networks and Applications,2021,26(5):1891-1903.
[10]DONG C H,ZHANG J J,ZONG C Q,et al.Character-based LSTM-CRF with radical-level features for Chinese named entity recognition [M]//Natural Language Understanding and Intelligent Applications.Cham:Springer,2016:239-250.
[11]LIU F,LU H,LO C,et al.Learning character-level compositio-nality with visual features[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,ACL 2017.Vancouver,2017:2059-2068.
[12]SONG C J,XIONG Y,HUANG W C,et al.Joint self-attention and multi-embeddings for Chinese named entity recognition[C]//Proceedings of the 6th International Conference on Big Data Computing and Communications(BIGCOM).New York:IEEE Press,2020:76-80.
[13]ZHANG Y,YANG J.Chinese NER using Lattice LSTM [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Stroudsburg:ACL Press,2018:1554-1564.
[14]MA R T,PENG M N,ZHANG Q,et al.Simplify the usage of lexicon in Chinese NER [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL Press,2020:5951-5960.
[15]LIU W,FU X Y,ZHANG Y,et al.Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).Online:Association for Computational Linguistics,2021:5847-5858.
[16]GRIDACH M.Character-level neural network for biomedicalnamed entity recognition[J].Journal of Biomedical Informatics,2017,70(5):85-91.
[17]YIN M W,MOU C J,XIONG K N,et al.Chinese clinical named entity re-cognition with radical-level feature and self-attention mechanism[J].Journal of Biomedical Informatics,2019,98(9):1-7.
[18]GONG D W,ZHANG Y K,GUO Y N,et al.Named entity re-cognition of Chinese electronic medical records based on multifeatured embedding and attention mechanism[J].Chinese Journal of Engineering,2021,43(9):1190-1196.
[19]LI Y B,WANG X H,HUI L H,et al.Chinese Clinical Named Entity Recognition in Electronic Medical Records:Development of a Lattice Long Short-Term Memory Model with Contextua-lized Character Representations[J].JMIR Medical Informatics,2020,8(9):1-16.
[20]ZHAO Y Q,CHE C ZHANG Q.Chinese medical named entity recognition based on new word discovery and Lattice-LSTM[J].Computer Applications and Software.2021(1):161-165.
[21]WANG X,ZHANG Y,REN X,et al.Cross-type biomedicalnamed entity recognition with deep multi-task learning[J].Bioinformatics,2019,35(10):1745-1752.
[22]HU B,GENG T Y,DENG G,et al.Faster biomedical named entity recognition based on knowledge distillation[J].Journal of Tsinghua University(Science and Technology),2021,61(9):936-942.
[23]PENG Y F,YANG S K,LU Z Y.Transfer learning in biome-dical natural language processing:an evaluation of BERT and ELMo on ten benchmarking datasets[C]//Proceedings of the 18th BioNLP Workshop and Shared Task.Florence:ACL,2019:58-65.
[24]GU Y,TINN R,CHENG H,et al.Domain-specific languagemodel pretraining for biomedical natural language processing[J].ACM Transactions on Computing for Healthcare(HEALTH),2021,3(1):1-23.
[25]WU S,SONG X N,FENG Z H.MECT:multi-metadata embedding based cross-transformer for Chinese named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Stroudsburg:ACL,2021:1529-1539.
[26]YANG J,ZHANG Y,DONG F.Neural word segmentation with rich pretraining[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Vancouver:ACL,2017:839-849.
[27]MA X Z,HOVY E.End-to-end Sequence labeling via Bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin:ACL Press,2016:1064-1074.
[28]YAN H,DENG B,LI X,et al.TENER:adapting transformer encoder for named entity recognition[J].arXiv:1911.04474,2019.
[29]GUI T,MA R,ZHANG Q,et al.CNN-Based Chinese NER with Lexicon Rethinking[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann,2019:4982-4988.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于多特征嵌入的中文医学命名实体识别

Chinese Medical Named Entity Recognition Based on Multi-feature Embedding

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0