计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 32-38.doi: 10.11896/jsjkx.210400198

• 智慧医疗 • 上一篇    下一篇

一种新的中文电子病历文本检索模型

于家畦1,2, 康晓东1, 白程程1, 刘汉卿1   

  1. 1 天津医科大学影像学院 天津 300202
    2 天津医科大学临床医学院 天津 300270
  • 出版日期:2022-06-10 发布日期:2022-06-08
  • 通讯作者: 康晓东(433906384@qq.com)
  • 作者简介:(yujiaqi0919@163.com)

New Text Retrieval Model of Chinese Electronic Medical Records

YU Jia-qi1,2, KANG Xiao-dong1, BAI Cheng-cheng1, LIU Han-qing1   

  1. 1 School of Medical Image,Tianjin Medical University,Tianjin 300202,China
    2 Clinical Medical College of Tianjin Medical University,Tianjin 300270,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:YU Jia-qi,born in 1989,postgraduate.Her main research interests include medical information processing and so on.
    KANG Xiao-dong,born in 1964,Ph.D,professor,postgraduate supervisor,is a member of China Computer Federation.His main research interests include medical image processing and medical information system integration.

摘要: 电子病历的增长构成用户健康大数据的基础,可提高医疗服务质量并降低医疗成本,因此迅速有效地检索病例在临床医学中具有实际意义。电子病历具有极强的专业性和独特的文本特点,然而传统的文本检索方法存在文本实体语义表达不准确、检索精度较低的不足。针对以上特点及问题,提出一种融合BERT-BiLSTM模型结构,以充分表达电子病历文本语义信息,提高检索的准确率。依据公开数据,首先,将公开的标准中文电子病历数据按临床诊断规则做关联扩展检索主题词预处理;其次,利用BERT模型,根据病历文本的上下文语境动态获取字粒度向量矩阵,再将生成的字向量作为双向长短时记忆网络模型(BiLSTM)的输入,以提取上下文信息的全局语义特征;最后,将检索文档的特征向量映射到欧氏空间中,找出与检索文档距离最近的病历文本,实现非结构化临床数据文本检索。仿真结果表明,该方法能够从病历文本中挖掘出多层次、多角度的文本语义特征,在电子病历数据集上取得的F1值为0.94,能显著提高文本语义检索准确率。

关键词: BERT模型, BiLSTM, 电子病历, 扩展检索主题词, 文本检索

Abstract: The growth of electronic medical records forms the basis of user health big data,which can improve the quality of medi-cal services and reduce medical costs.Therefore,the rapid and effective retrieval of cases has practical significance in clinical medi-cine.Electronic medical records have strong professionalism and unique text characteristics.However,traditional text retrieval methods have the disadvantages of inaccurate text entity semantic expression and low retrieval accuracy.In view of the above characteristics and problems,this paper proposes a fusion BERT-BiLSTM model structure to fully express the semantic information of the electronic medical record text and improve the accuracy of retrieval.This research is based on public data.First,correlation extension retrieval keywords prerpocessing is carried on the open standard Chinese EMR data according to clinical diagnosis rules.Secondly,the BERT model is used to dynamically obtain the word granularity vector matrix according to the context of the medical record text,then the generated word vector is used as the input of the bidirectional long and short-term memory network model(BiLSTM) to extract the global semantic features of the context information.Finally,the feature vector of the retrieved document is mapped to the Euclidean space,and the medical record text closest to the retrieved document is found to realize the text retrieval of unstructured clinical data.Simulation results show that this method can dig out multi-level and multi-angle text semantic features from the medical record text,the F1 value obtained on the electronic medical record data set is 0.94,which can significantly improve the accuracy of text semantic retrieval.

Key words: BERT model, Bidirectional long and short-term memory network model, Electronic medical record, Extended search keywords, Text retrieval

中图分类号: 

  • TP391
[1] KANG X D.Image informatics[M].People's Medical Publi-shing House,2009.
[2] JOON L,MASLOVE D M,DUBIN J A,et al.Personalized Mortality Prediction Driven by Electronic Medical Data and a Patient Similarity Metric[J].Plos One,2015,10(5):e0127428.
[3] NG K,SUN J,HU J Y,et al.Personalized Predictive Modeling and Risk Factor Identification using Patient Similarity[J].Amia JT Summits Transl Sci Proc,2015,2015:132-136.
[4] LI L,CHENG W Y,GLICKSBERG B S,et al.Identification oftype 2 diabetes subgroups through topological analysis of patient similarity[J].Science Translational Medicine,2015,7(311):311ra174-311ra174.
[5] DING Z J,YANG Q,ZHANG H B,et al.Review of retrievalmodels based on unstru-ctured text[J].Application Research of Computers,2017,34(6):1601-1608,1612.
[6] DWORK C.Differential Privacy[C]//Proceedings of the 33rd international conference on Automata,Languages and Programming-Volume Part II.Berlin:Springer,2006.
[7] SALTON G.A Vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.
[8] CAO D L,LIN D Z.A review of text retrieval models[J].Mind and Computing,2007(4):426-432.
[9] WANG X,LUO E P,ZHANG J.Intelligent full-text retrieval of electronic medical records based on semantics[J].Medical and Medical Equipment,2008(4):52-53.
[10] DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the Association for Information Science & Technology,2010,41(6):391-407.
[11] CHEN L,TOKUDA N,NAGAI A.A new differential LSIspace-based probabilistic document classifier[J].Information Processing Letters,2003,88(5):203-212.
[12] BLEI D M,NG A Y,JORDAN M I,et al.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[13] WU D.Research and application of electronic medical record retrieval based on latent semantic correlation algorithm[D].Shenyang:Northeastern University,2012.
[14] SHI Q Q.Research on semantic retrieval methods of medical records based on LDA and LSA[D].Shenyang:Northeastern University,2014.
[15] SUN J W.Application of Chinese Document Retrieval Based on Deep Learning[D].Jilin:Jilin University,2015.
[16] KIM Y.Convolutional Neural Networks for Sentence Classification[C]//EMNLP 2014.2014.
[17] GRAVES A,MOHAMED A R,HINTON G.Speech Recognition with Deep Recurrent Neural Networks[C]//International Conference on Acoustics,Speech,and Signal Processing(ICASSP'88).2013.
[18] HUANG N E,SHEN Z,LONG S R,et al.The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis[J].Proceedings A,1998,454(1971):903-995.
[19] ZHANG J Y.Research on temporal semantic similarity in electronic medical record retrieval[D].Beijing:Beijing University of Posts and Telecommunications,2018.
[20] REI M.Semi-supervised Multitask Learning for Sequence Labeling[C]//Proceedings of the 55th Annual Meeting of the Assocoation for Compinnal Linguistics.2017.
[21] ZHOU S,XU S,XU B.Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages[J].arXiv:1806.05059,2018.
[22] PALANGI H,DENG L,SHEN Y,et al.Deep Sentence Embedding Using Long Short-Term Memory Networks:Analysis and Application to Information Retrieval[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2016,24(4):694-707.
[23] SAK H,SENIOR A,BEAUFAYS F.Long Short-Term Me-mory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition[J].arXiv:1402.1128,2014.
[24] ZHANG X C,DAI X Y,LIU L,et al.Chinese short text classification model with multi-head self-attention mechanism[J].Journal of Computer Applications,2020,40(12):3485-3489.
[1] 袁昊男, 王瑞锦, 郑博文, 吴邦彦.
基于Fabric的电子病历跨链可信共享系统设计与实现
Design and Implementation of Cross-chain Trusted EMR Sharing System Based on Fabric
计算机科学, 2022, 49(6A): 490-495. https://doi.org/10.11896/jsjkx.210500063
[2] 范红杰, 李雪冬, 叶松涛.
面向电子病历语义解析的疾病辅助诊断方法
Aided Disease Diagnosis Method for EMR Semantic Analysis
计算机科学, 2022, 49(1): 153-158. https://doi.org/10.11896/jsjkx.201100125
[3] 周艺华, 贾玉欣, 贾立圆, 方嘉博, 侍伟敏.
基于红黑树的共享电子病历数据完整性验证方案
Data Integrity Verification Scheme of Shared EMR Based on Red Black Tree
计算机科学, 2021, 48(9): 330-336. https://doi.org/10.11896/jsjkx.200600139
[4] 董哲, 邵若琦, 陈玉梁, 翟维枫.
基于BERT和对抗训练的食品领域命名实体识别
Named Entity Recognition in Food Field Based on BERT and Adversarial Training
计算机科学, 2021, 48(5): 247-253. https://doi.org/10.11896/jsjkx.200800181
[5] 陈明豪, 祝跃飞, 芦斌, 翟懿, 李玎.
基于Attention-CNN的加密流量应用类型识别
Classification of Application Type of Encrypted Traffic Based on Attention-CNN
计算机科学, 2021, 48(4): 325-332. https://doi.org/10.11896/jsjkx.200900155
[6] 周晓进, 徐陈铭, 阮彤.
面向中文电子病历的多粒度医疗实体识别
Multi-granularity Medical Entity Recognition for Chinese Electronic Medical Records
计算机科学, 2021, 48(4): 237-242. https://doi.org/10.11896/jsjkx.200100036
[7] 余杰, 纪斌, 刘磊, 李莎莎, 马俊, 刘慧君.
面向中文医疗事件的联合抽取方法
Joint Extraction Method for Chinese Medical Events
计算机科学, 2021, 48(11): 287-293. https://doi.org/10.11896/jsjkx.201200016
[8] 龚扣林, 周宇, 丁笠, 王永超.
基于BiLSTM模型的漏洞检测
Vulnerability Detection Using Bidirectional Long Short-term Memory Networks
计算机科学, 2020, 47(5): 295-300. https://doi.org/10.11896/jsjkx.190800046
[9] 崔丹丹, 刘秀磊, 陈若愚, 刘旭红, 李臻, 齐林.
基于Lattice LSTM的古汉语命名实体识别
Named Entity Recognition in Field of Ancient Chinese Based on Lattice LSTM
计算机科学, 2020, 47(11A): 18-23. https://doi.org/10.11896/jsjkx.200500090
[10] 王子牛, 姜猛, 高建瓴, 陈娅先.
基于BERT的中文命名实体识别方法
Chinese Named Entity Recognition Method Based on BERT
计算机科学, 2019, 46(11A): 138-142.
[11] 李晓蓉,宋子夜,任婧怡,徐磊,许春根.
云计算中基于属性的可搜索加密电子病历系统
Attribute-based Searchable Encryption of Electronic Medical Records in Cloud Computing
计算机科学, 2017, 44(Z11): 342-347. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.072
[12] 王莹,陈伟鹤,鞠时光.
一种适用于电子病历系统的使用控制模型
Application of UCON Model on Electronic Medical Record
计算机科学, 2010, 37(11): 190-193.
[13] 丁卫平,顾春华,石振国,陈建平,管致锦.
基于形式概念分析的不完备电子病历系统粗糙挖掘研究
Research of Formal Concepts Rough Mining under Incomplete Electronic Patient Record Knowledge System
计算机科学, 2009, 36(10): 230-233.
[14] .
电子病历数据预处理技术

计算机科学, 2007, 34(3): 141-144.
[15] .
基于XML多重签名的电子病历安全机制

计算机科学, 2007, 34(12): 136-138.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!