计算机科学 ›› 2021, Vol. 48 ›› Issue (4): 237-242.doi: 10.11896/jsjkx.200100036

• 人工智能 • 上一篇    下一篇

面向中文电子病历的多粒度医疗实体识别

周晓进1, 徐陈铭2, 阮彤1   

  1. 1 华东理工大学信息科学与工程学院 上海200237
    2 华东理工大学理学院 上海200237
  • 收稿日期:2020-06-24 出版日期:2021-04-15 发布日期:2021-04-09
  • 通讯作者: 阮彤(ruantong@ecust.edu.cn)
  • 基金资助:
    “精准医学研究”重大专项项目(2018YFC0910500);国家自然科学基金项目(61772201)

Multi-granularity Medical Entity Recognition for Chinese Electronic Medical Records

ZHOU Xiao-jin1, XU Chen-ming2, RUAN Tong1   

  1. 1 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
    2 School of Science,East China University of Science and Technology,Shanghai 200237,China
  • Received:2020-06-24 Online:2021-04-15 Published:2021-04-09
  • About author:ZHOU Xiao-jin,born in 1996,postgra-duate,is a student member of China Computer Federation.His main research interests include natural language processingand information extraction.(zhouxiaojin@mail.ecust.edu.cn)
    RUAN Tong,born in 1973,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include text extraction,knowledge graph and data quality assessment.
  • Supported by:
    Major Special Project of Precision Medical Research(2018YFC0910500) and National Natural Science Foundation of China(61772201).

摘要: 在现有的面向中文临床电子病历的命名实体识别任务中,实体标注粒度通常过细或过粗,过细的标注结果难以找到实际应用场景,而过粗的标注结果通常需要在进行复杂的处理后,才能明确实体的规范形式和语义类型,以便于后续的数据挖掘应用。为简化处理步骤,根据常见的7类粗粒度临床实体的特点,定义了用以解释粗粒度实体的9类细粒度解析实体。同时,针对多粒度实体的特点,提出了基于多任务学习和自注意力机制的多粒度临床实体识别模型,并在真实的医院电子病历库中标注了5 000条包含多粒度实体的文本以验证模型的效果。实验结果表明,该模型优于主流的序列标注模型,在粗、细粒度实体识别任务中,两者的F1值分别达到了92.88和85.48。

关键词: 电子病历, 多粒度实体识别, 多任务学习

Abstract: In the existing named entity recognition task for Chinese clinical electronic medical records,the granularity of annotation is usually too fine or too coarse,and it is difficult to find actual application scenarios for the too thin annotation results while the too thick annotation results usually need complex post-processing steps to clarify the standard form and the semantic type of entities,so as to facilitate subsequent data mining applications.In order to simplify post-processing steps,9 kinds of fine-grained analytical entities are defined to explain coarse-grained entities according to characteristics of 7 common coarse-grained clinical entities.Besides,according to characteristics of multi-granularity entities,a multi granularity clinical entity recognition model based on multi-task learning and self-attention mechanism is proposed,and 5 000 texts containing multi-granular entities are annotated on real hospital electronic medical records to verify the model.Experiment results show that this model outperforms the mainstream sequence labeling model.In the task of coarse and fine granularity entity recognition,their F1 scores reach 92.88 and 85.48,respectively.

Key words: Electronic medical records, Multi-granularity named entity recognition, Multi-task learning

中图分类号: 

  • TP391
[1]HE B,DONG B,GUANY,et al.Building a comprehensive syntactic and semantic corpus of Chinese clinical texts[J].Journal of Biomedical Informatics,2017,69:203-217.
[2]FUKUDA K,TSUNODA T,TAMURA A,et al.Toward information extraction:identifying protein names from biological papers[C]//Pac Sympbiocomput.1998:707-718.
[3]FRIEDMAN C,ALDERSON P O,AUSTIN J H M,et al.Ageneral natural-language text processor for clinical radiology[J].Journal of the American Medical Informatics Association,1994,1(2):161-174.
[4]SONG M,YU H,HANW S.Developing a hybrid dictionary-based bio-entity recognition technique[J].BMC Medical Informatics and Decision Making,2015,15(1):S9.
[5]ZHAO S.Named entity recognition in biomedical texts using an HMM model[C]//Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications.Association for Computational Linguistics,2004:84-87.
[6]FINKEL J R,DINGARE S,NGUYEN H,et al.Exploiting context for biomedical entity recognition:from syntax to the web[C]//Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications(NLPBA/BioNLP).2004:91-94.
[7]SETTLES B.Biomedical named entity recognition using condi-tional random fields and rich feature sets[C]//Proceedings of the International Joint Workshop on Natural Language Proces-sing in Biomedicine and its Applications(NLPBA/BioNLP).2004:107-110.
[8]HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv:1508.01991,2015.
[9]GRIDACH M.Character-level neural network for biomedicalnamed entity recognition[J].Journal of Biomedical Informatics,2017,70:85-91.
[10]DANG T H,LE H Q,NGUYEN T M,et al.D3NER:biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information[J].Bioinformatics,2018,34(20):3539-3548.
[11]LIU J,CHEN S,HE Z,et al.Learning BLSTM-CRF with Multi-channel Attribute Embedding for Medical Information Extraction[C]//CCF International Conference on Natural Language Processing and Chinese Computing.Springer,Cham,2018:196-208.
[12]GIORGI J M,BADER G D.Transfer learning for biomedicalnamed entity recognition with neural networks[J].Bioinforma-tics,2018,34(23):4087-4094.
[13]QIU J,WANG Q,ZHOU Y,et al.Fast and Accurate Recognition of Chinese Clinical Named Entities with Residual Dilated Convolutions[C]//2018 IEEE International Conference on Bioinformatics and Biomedicine(BIBM).IEEE,2018:935-942.
[14]WANG Q,ZHOU Y,RUAN T,et al.Incorporating dictionaries into deep neural networks for the chinese clinical named entity recognition[J].Journal of Biomedical Informatics,2019,92:103-133.
[15]LUONG M T,LE Q V,SUTSKEVER I,et al.Multi-task se-quence to sequence learning[J].arXiv:1511.06114,2015.
[16]ZENG L,GAO D Q,RUAN T,et al.Analysis and marking of symptom composition based on CRF[J].Journal of East China University of Science and Technology(Natural Science Edition),2018(2):277-282.
[17]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[18]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed rep-resentations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[19]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[20]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[21]MA X,HOVY E.End-to-end sequence labeling via bi-directional lstm-cnns-crf[J].arXiv:1603.01354,2016.
[22]ZHENG G,MUKHERJEE S,DONG X L,et al.OpenTag:Open attribute value extraction from product profiles[C]//Procee-dings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.ACM,2018:1049-1058.
[1] 袁昊男, 王瑞锦, 郑博文, 吴邦彦.
基于Fabric的电子病历跨链可信共享系统设计与实现
Design and Implementation of Cross-chain Trusted EMR Sharing System Based on Fabric
计算机科学, 2022, 49(6A): 490-495. https://doi.org/10.11896/jsjkx.210500063
[2] 于家畦, 康晓东, 白程程, 刘汉卿.
一种新的中文电子病历文本检索模型
New Text Retrieval Model of Chinese Electronic Medical Records
计算机科学, 2022, 49(6A): 32-38. https://doi.org/10.11896/jsjkx.210400198
[3] 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建.
基于注意力机制和多任务学习的阿尔茨海默症分类
Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning
计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072
[4] 赵凯, 安卫超, 张晓宇, 王彬, 张杉, 相洁.
共享浅层参数多任务学习的脑出血图像分割与分类
Intracerebral Hemorrhage Image Segmentation and Classification Based on Multi-taskLearning of Shared Shallow Parameters
计算机科学, 2022, 49(4): 203-208. https://doi.org/10.11896/jsjkx.201000153
[5] 杨晓宇, 殷康宁, 候少麒, 杜文仪, 殷光强.
基于特征定位与融合的行人重识别算法
Person Re-identification Based on Feature Location and Fusion
计算机科学, 2022, 49(3): 170-178. https://doi.org/10.11896/jsjkx.210100132
[6] 范红杰, 李雪冬, 叶松涛.
面向电子病历语义解析的疾病辅助诊断方法
Aided Disease Diagnosis Method for EMR Semantic Analysis
计算机科学, 2022, 49(1): 153-158. https://doi.org/10.11896/jsjkx.201100125
[7] 周艺华, 贾玉欣, 贾立圆, 方嘉博, 侍伟敏.
基于红黑树的共享电子病历数据完整性验证方案
Data Integrity Verification Scheme of Shared EMR Based on Red Black Tree
计算机科学, 2021, 48(9): 330-336. https://doi.org/10.11896/jsjkx.200600139
[8] 宋龙泽, 万怀宇, 郭晟楠, 林友芳.
面向出租车空载时间预测的多任务时空图卷积网络
Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction
计算机科学, 2021, 48(7): 112-117. https://doi.org/10.11896/jsjkx.201000089
[9] 郭文, 尹童灵, 张天柱, 徐常胜.
时间一致性保持的多任务稀疏深度表达视觉跟踪
Temporal Consistency Preserving Multi-Mask Sparse Deep Representation for Visual Tracking
计算机科学, 2021, 48(6): 110-117. https://doi.org/10.11896/jsjkx.200800212
[10] 刘小龙, 韩芳, 王直杰.
基于知识表示的联合问答模型
Joint Question Answering Model Based on Knowledge Representation
计算机科学, 2021, 48(6): 241-245. https://doi.org/10.11896/jsjkx.200600011
[11] 张春云, 曲浩, 崔超然, 孙皓亮, 尹义龙.
基于过程监督的序列多任务法律判决预测方法
Process Supervision Based Sequence Multi-task Method for Legal Judgement Prediction
计算机科学, 2021, 48(3): 227-232. https://doi.org/10.11896/jsjkx.200700056
[12] 余杰, 纪斌, 刘磊, 李莎莎, 马俊, 刘慧君.
面向中文医疗事件的联合抽取方法
Joint Extraction Method for Chinese Medical Events
计算机科学, 2021, 48(11): 287-293. https://doi.org/10.11896/jsjkx.201200016
[13] 王体爽, 李培峰, 朱巧明.
基于数据增强的中文隐式篇章关系识别方法
Chinese Implicit Discourse Relation Recognition Based on Data Augmentation
计算机科学, 2021, 48(10): 85-90. https://doi.org/10.11896/jsjkx.200800115
[14] 潘祖江, 刘宁, 张伟, 王建勇.
基于层次注意力机制的多任务疾病进展模型
MTHAM:Multitask Disease Progression Modeling Based on Hierarchical Attention Mechanism
计算机科学, 2020, 47(9): 185-189. https://doi.org/10.11896/jsjkx.190900001
[15] 周子钦, 严华.
基于多任务学习的有限样本多视角三维形状识别算法
3D Shape Recognition Based on Multi-task Learning with Limited Multi-view Data
计算机科学, 2020, 47(4): 125-130. https://doi.org/10.11896/jsjkx.190700163
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!