计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230500164-9.doi: 10.11896/jsjkx.230500164

• 人工智能 • 上一篇    下一篇

基于Transformer的司法文书命名实体识别方法

王颖洁1, 张程烨1, 白凤波2, 汪祖民1   

  1. 1 大连大学信息工程学院 大连 116622
    2 广西民族大学人工智能学院 南宁 530006
  • 发布日期:2024-06-06
  • 通讯作者: 白凤波(baif@gxun.edu.cn)
  • 作者简介:(wb@hongstech.com)

Named Entity Recognition Approach of Judicial Documents Based on Transformer

WANG Yingjie1, ZHANG Chengye1, BAI Fengbo2, WANG Zumin1   

  1. 1 College of Information Engineering,Dalian University,Dalian 116622,China
    2 School of Artificial Intelligence,Guangxi Minzu University,Nanning 530006,China
  • Published:2024-06-06
  • About author:WANG Yingjie,born in 1977,Ph.D,associate professor,is a member of CCF(No.39234M).Her main research interests include software engineering and trustworthy software.
    BAI Fengbo,born in 1978,Ph.D,senior software engineer,is a member of CCF(No.F6846M).His main research interests include natural language processing,data science,evidence science,etc.

摘要: 命名实体识别是自然语言处理领域的关键任务之一,是实现下游任务的基础。目前针对司法领域的相关研究相对较少,司法系统的信息化和智能化转型仍有许多问题亟需解决。相比其他领域的文本,司法文书存在专业性强、语料资源少等局限,导致现有的司法文书识别结果较低。因此,从以下3方面开展研究:首先,提出了一种多标签层级迭代的文本标注方式,可以对原始司法文书文本进行自动化标注,同时有效地提升司法文书命名实体识别任务的实体识别效果;其次,提出了一种交融式的Transformer神经网络模型,对汉字固有属性的深层特征进行了充分利用,用于对司法文书进行命名实体识别;最后,对所提出的标注方法和模型与其他神经网络模型进行了对比实验。所提出的文本标注方式可以较为准确地实现司法文书的标注任务;同时,所提出的模型在通用数据集中相对于对照模型有较大的提高,并在司法领域数据集中取得了良好的效果。

关键词: 自然语言处理, 数据标注, Transformer模型, 深度学习, 司法信息化

Abstract: Named entity recognition is one of the key tasks in the field of natural language processing,and it is the foundation of downstream tasks.At present,there are relatively few research results on the judicial field,and there are still many problems need to be solved in the informatization and intelligent transformation of the judicial system.Compared with texts in other fields,judicial documents have limitations such as strong professionalism and few corpus resources,leading to low recognition results of existing judicial documents.Therefore,the research is carried out from the following three aspects.Firstly,a multi-label hierarchical iterative annotation method(ML-HIA) is proposed,which can automatically annotate the original judicial documents and effectively improve the effect of the entity recognition task of judicial documents.Secondly,an feature mixed Transformer(FM-Transformer) neural network model,which makes full use of the deep features of the inherent attributes of Chinese characters,is proposed to identify named entities of judicial documents.Finally,the proposed method and model are compared with other neural network models.The proposed method of text annotation can realize the task of judicial document annotation accurately.At the same time,compared with other models,the proposed model has a great improvement in the general dataset,and has achieved good results in the judicial datasets.

Key words: Natural language processing, Data annotation, Transformer model, Deep learning, Judicial informatization

中图分类号: 

  • TP391
[1]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:A method for automatic evaluation of machine translation[C]//Proceedings of the Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2002:311-318.
[2]LIN C Y.ROUGE:A package for automatic evaluation of summaries[C]//Proceedings of the Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2004:74-81.
[3]LAVIE A,AGARWAL A.METEOR:An automatic metric for MT evaluation with high levels of correlation with human judgments[C]//Proceedings of the Workshop on Statistical Machine Translation.Stroudsburg,PA:ACL,2007:228-231.
[4]ANDERSON P,FERNANDO B,JOHNSON M,et al.SPICE:Semantic propositional image caption evaluation[C]//Procee-dings of the 14th European Conference on ComputerVision(ECCV 2016).Amsterdam,The Netherlands,2016:382-398.
[5]VEDANTAM R,ZITNICK C L,PARIKH D,et al.CID Er:Consensus-based image description evaluation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Re-cognition(CVPR).Piscataway,NJ:IEEE,2015:4566-4575.
[6]DEMARTINI G,DIFALLAH D E,CUDREMAUROUX P.ZenCrowd:Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]//Proceedings of the 21st international conference on World Wide Web.New York:ACM,2012:469-478.
[7]PAN Z G.Research on the recognition of Chinese named entity based on rules and statistics[J].Information Science,2012,30(5):708-712.
[8]FENG Y,JIANG B,WANG L,et al.Cybersecurity named entity recognition using multi-modal ensemble learning[J].IEEE Access,2020,8:63214-63224.
[9]ZHAO Z H,YANG Z B,LUO L,et al.Disease named entity recognition from biomedical literature using a novel convolu-tional neural network[J].BMC Medical Genomics,2017,10(S5):75-83.
[10]WANG P H,LI M Z,LI S.Data augmentation for Chinese clinical named entity recognition[J].Journal of Beijing University of Posts and Telecommunications,2020,43(5):84-90.
[11]AGUILAR G,MAHARJAN S,SOLORIO T,et al.A multi-task approach for named entity recognition in social media data[J].arXiv:1906.04135,2019.
[12]GUO X C,TANG Z,DIAO L,et al.Recognition of Chinese agricultural diseases and pests named entity with joint radical-embedding and self-attention mechanism[J].Transactions of the Chinese Society for Agricultural Machinery,2020,51(S2):335-343.
[13]ZHANG H,GUO Y B,LI T.Domain named entity recognition combining GAN and BiLSTM-attention-CRF[J].Journal of Computer Research and Development,2019,56(9):1851-1858.
[14]DAS P,DAS K A,NAYAK J,et al.A graph based clustering approach for relation extraction from crime data[J].IEEE Access,2019,7,101269-101282.
[15]ZHAO P F,ZHAO C J,WU H R,et al.Named entity recognition of Chinese agricultural text based on attention mechanism[J].Transactions of the Chinese Society for Agricultural Machinery,2021,52(1):185-192.
[16]YU Y X,LI X.Research on text annotation method of ancient works from the perspective of digital humanities:a case study on MARKUS[J].Big Data Research,2022,8(6):15-25.
[17]ZHANG K L,ZHAO X,GUAN T F,et al.A platform for entity and entity relationship labeling in medical texts[J].Journal of Chinese Information Processing,2020,34(6),36-44.
[18]LAWSON N,EUSTICE K,PERKOWITZ M,et al.Annotating large email datasets for named entity recognition with Mechanical Turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mecha-nical Turk.Stroudsburg:ACL,2010:71-79.
[19]BOSTOCK M,OGIEVETSKY V,HEER J.D3:Data-DrivenDocuments[J].IEEE Transactions on Visualization and Computer Graphics,2011,17(12):2301-2309.
[20]MENDEZ G G,NACENTA M A.iVoLVER:Interactive Visual Language for Visualization Extraction and Reconstruction[C]//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.New York:ACM,2017:4073-4085.
[21]REN D H,HOLLERER T,YUAN X R.iVisDesigner:Expressive Interactive Design of Information Visualizations[J].IEEE Transactions on Visualization and Computer Graphics,2014,20(12):2092-2101.
[22]ZHANG Y,WANG Y,ZHANG H D,et al.OneLabeler:A Fle-xible System for Building Data Labeling Tools[C]//Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.Stroudsburg:ACL,2022:1-22.
[23]FAN H,HUANG H C,WANG X,et al.Research on Know-ledge Extraction Technology of Power Grid Text Data based on Semantic annotation[C]//Proceedings of the Third Smart Grid Conference.2018:146-150.
[24]ZHU Y,JING L P,YU J.An Active Labeling Method for Text Data Based on Nearest Neighbor and Information Entropy[J].Journal of Computer Research and Development,2012,49(6):1306-1312.
[25]SUN Q,HE W X,CHEN L Y,et al.Multi-Label Automatic Labeling for Question Attributes Based on Adaboost and Bayes Algorithms[C]//Proceedings of 2018 Chinese Automation Congress.Piscataway,NJ:IEEE,2018,2955-2960.
[26]BEYETTE D,WANG Z L,LIN J,et al.semi-automatic LaTeX-based labeling of mathematical objects in PDF documents:MOP data set[C]//Proceedings of the ACM Symposium on Document Engineering 2019.Stroudsburg:ACL,2019:1-4.
[27]FORT K,EHRMANN M,NAZARENKO A,et al.Towards amethodology for named entities annotation[C]//Proceedings of the Third Linguistic Annotation Workshop.Stroudsburg:ACL,2009:142-145.
[28]CAI Y B.AI Assistance:How to Handle Civil and Commercial Cases[J].Oriental Law,2018,18(3):131-139.
[29]WENG Y,GU S Y,LI J,et al.Paragraph Context-Based Text Classification Approach for Large-Scale Judgment Text Structuring[J].Journal of Tianjin University(Science and Technology),2021,54(4):418-425.
[30]ZHANG Y,YANG J.Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2018:1554-1564.
[31]SUI D B,CHEN Y B,LIU K,et al.Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Stroudsburg:ACL,2019:3830-3840.
[32]GUI T,ZOU Y C,ZHANG Q,et al.A lexicon-based graph neural network for Chinese NER[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Stroudsburg:ACL,2019:1040-1050.
[33]MA R T,GUI T,ZHANG Q,et al.CNN-based Chinese NER with lexicon rethinking[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.Menlo Park,CA:AAAI,2019:4982-4988.
[34]KONG B,LIU S Q,WEI F Y,et al.Chinese relation extraction using extend softword[J].IEEE Access,2021,9:110299-110308.
[35]LI X N,YAN H,QIU X P,et al.FLAT:Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2020:6836-6842.
[36]LIU F,LU H,LO C,et al.Learning character-level compositionality withvisual features[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2017:2059-2068.
[37]SU T R,LEE H Y.Learning Chinese word representations from glyphs of characters[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2017:264-273.
[38]MENG Y X,WU W,LI X Y,et al.Glyce:Glyph-vectors for Chinese character representations[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.New York:ACM,2019:2746-2757.
[39]CAO S H,LU W,ZHOU J,et al.Cw2vec:Learning Chinese wordembeddings with stroke n-gram information[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence,the 30th innovative Applications of Artificial Intelligence,and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence.Menlo Park:AAAI,2018:5053-5061.
[40]SUN Y M,LIN L,YANG N,et al.Radical-enhanced Chinese character embedding[C]//Proceedings of the 21st International Conference on Neural Information Processing.Berlin,Heidelberg:Springer,2014:279-286.
[41]SHAO Y,HARDMEIER C,TIEDEMANN J,et al.Character-based joint segmentation andpos tagging for Chinese using bidirectional RNN-CRF[C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing.Stroudsburg:ACL,2017:173-183.
[42]ZHANG Y,LIU Y G,ZHU J J,et al.Learning Chinese wordembeddings from stroke,structure and pinyin of characters[C]//Proceedings of the 28th ACM International Conference on Information and Knowledge Management.New York:ACM,2019:1011-1020.
[43]ZHU W H,JIN X,NI J Y,et al.Improve word embedding using both writing and pronunciation[J].PLoS One,2018,13(12):1-13.
[44]CHAUDHARY A,ZHOU C,LEVIN L,et al.Adapting wordembeddings to new languages with morphological and phonological subword representations[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2018:3285-3295.
[45]PENG N Y,DREDZE M.Named entity recognition for Chinese social media with jointly trained embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2015:548-554.
[46]PENG N Y,DREDZE M.Improving named entity recognitionfor Chinese social media with word segmentation representation learning[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers).Stroudsburg:ACL,2016:149-155.
[47]HE H F,SUN X.F-score driven max margin neural network for named entity recognition in Chinese social media[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics:Volume 2.Stroudsburg:ACL,2017:713-718.
[48]HE H F,SUN X.A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.Menlo Park:AAAI,2017:3216-3222.
[49]CAO P F,CHEN Y B,LIU K,et al.Adversarial transfer lear-ning for Chinese named entity recognition with self-attention mechanism[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2018:182-192.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!