计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230200083-6.doi: 10.11896/jsjkx.230200083

• 人工智能 • 上一篇    下一篇

基于MacBERT和对抗训练的审计文本命名实体识别

钱泰羽, 陈一飞, 庞博文   

  1. 南京审计大学计算机学院 南京 211815
  • 发布日期:2023-11-09
  • 通讯作者: 陈一飞(yifeichen91@nau.edu.cn)
  • 作者简介:(qiantyy@163.com)
  • 基金资助:
    江苏省研究生科研与实践创新计划项目(SJCX22_0995)

Audit Text Named Entity Recognition Based on MacBERT and Adversarial Training

QIAN Taiyu, CHEN Yifei, PANG Bowen   

  1. School of Computer Science,Nanjing Audit University,Nanjing 211815,China
  • Published:2023-11-09
  • About author:QIAN Taiyu,born in 1994,postgra-duate,is a member of China Computer Federation.His main research interest is text mining.
    CHEN Yifei,born in 1977,Ph.D,asso-ciate professor.Her main research in-terests include text mining and intelligent information extraction.
  • Supported by:
    Postgraduate Research & Practice Innovation Program of Jiangsu Province(SJCX22_0995).

摘要: 为了从审计文本中自动识别有效的实体信息,提高政策跟踪审计的效率,提出一种基于MacBERT(MLM as correction BERT)和对抗训练的审计文本命名实体识别(Named Entity Recognition,NER)模型(Audit-MBCA)。目前深度学习在NER任务上应用成熟且成果显著,但审计文本存在语料库缺乏、实体边界识别不清晰等问题。针对这些问题,文中构建了审计文本数据集并将其命名为Audit 2022,使用MacBERT中文预训练语言模型获得其向量表示,同时引入对抗训练,利用中文分词(Chinese Word Segmentation,CWS)任务与NER任务的共享词边界信息帮助进行实体边界识别。实验结果表明,Audit-MBCA模型在Audit 2022数据集上的F1值为91.05%,较主流模型提升了4.53%;在SIGHAN 2006数据集上的F1值为93.70%,较其他模型提升了0.33%~3.25%,验证了所提模型的有效性和泛化能力。

关键词: 审计文本, 命名实体识别, MacBERT, 对抗训练

Abstract: In order to automatically identify the effective entity information from the audit text and improve the efficiency of policy tracking audit,a named entity recognition(NER) of audit text model(Audit-MBCA) based on MacBERT(MLM as correction BERT) and adversarial training is proposed.At present,deep learning has been maturely applied to NER task and achieved signi-ficant results.However,the audit text has some problems such as lacking corpus and unclear entity boundary recognition.To address these problems,the audit text dataset named Audit2022 is constructed in this paper.Its vector representation is obtained by using the MacBERT Chinese pre-training language model.At the same time,adversarial training is introduced and the shared word boundary information of Chinese word segmentation(CWS) task and NER task is used to help identify entity boundaries.Experimental results show that the value of F1 on the Audit2022 dataset from the Audit-MBCA model is 91.05%,which is 4.53% higher than the mainstream model;the value of F1 on the SIGHAN2006 dataset is 93.70%,which is 0.33%~3.25% higher than other models.These verify the effectiveness and generalization ability of the proposed model.

Key words: Audit text, Named entity recognition, MacBERT, Adversarial training

中图分类号: 

  • TP391
[1]ZHANG W,WU Z A.Application of Natural Language Analysis of Unstructured Text Data in Policy Tracking Audit[J].Audit Observation,2022(4):70-75.
[2]CHEN X,OUYANG C,LIU Y,et al.Improving the named entity recognition of Chinese electronic medical records by combining domain dictionary and rules[J].International Journal of Environmental Research and Public Health,2020,17(8):2687-2703.
[3]YU H K,ZHANG H P,LIU Q,et al.Chinese named entityidentification using cascaded hidden Markov model[J].Journal on Communications,2006,27(2):87-94.
[4]ZHANG Y J,XU Z T,XUE X Y.Fusion of Multiple Features for Chinese Named Entity Recognition Based on Maximum Entropy Model[J].Journal of Computer Research and Development,2008,45(6):1004-1010.
[5]TANG B Z,CAO H X,WU Y H,et al.Recognizing clinical entities in hospital discharge summaries using structural support vector machines with word representation features[J].BMC Medical Informatics and Decision Making,2013,13(S1):1-10.
[6]PATIL N,PATIL A,PAWAR B V.Named entity recognitionusing conditional random fields[J].Procedia Computer Science,2020,167:1181-1188.
[7]HAMMERTON J.Named entity recognition with long short-term memory[C]//Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.2003:172-175.
[8]LAMPLE G,BALLESTEROS M,SUBRA-MANIAN S,et al.Neural architectures for named entity recognition[J].arXiv:1603.01360,2016.
[9]CHI Y N.Research on Question and Answer Technology ofCorporate Financial Audit Based on Deep Learning[D].Harbin:Harbin Engineering University,2018.
[10]CUI Y,CHE W,LIU T,et al.Revisiting pre-trained models for Chinese natural language processing[J].arXiv:2004.13922,2020.
[11]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[12]ZHANG H F,ZENG C,PAN L,News topic text classification method based on BERT and feature projection network[J].Journal of Computer Applications,2022,42(4):1116-1124.
[13]CUI Y,CHE W,LIU T,et al.Pre-training with whole word masking for Chinese bert[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3504-3514.
[14]JIAO K N,LI X,YE H,et al.Fine-grained entity recognitionbased on MacBERT-BiLSTM-CRF in anti-terrorism field[J].Science Technology and Engineering,2021,21(29):12638-12648.
[15]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative Adversarial Nets[C]// Neural Information Processing Systems.MIT Press,2014:2672-2680.
[16]CAO P,CHEN Y,LIU K,et al.Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:182-192.
[17]ZHANG L L.Research on Identification of the Chinese Named Entity Based on Deep Learning[D].Taiyuan:Taiyuan University of Science and Technology,2021.
[18]LEVOW G A.The third international Chinese language proces-sing bakeoff:Word segmentation and named entity recognition[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.2006:108-117.
[19]YIN Z Z,LI X Z,HUANG D G,et al.Chinese Named EntityRecognition Ensembled with Character[J].Journal of Chinese Information Processing,2019,33(11):95-100,106.
[20]JIA Y,XU X.Chinese named entity recognition based on CNN-BiLSTM-CRF[C]//2018 IEEE 9th International Conference on Software Engineering and Service Science(ICSESS).IEEE,2018:1-4.
[21]TAO Y,PENG Y B.Chinese named entity recognition based on Gated-CNN-CRF[J].Electronic Design Engineering,2020,28(4):42-46,51.
[22]ZHANG Y,YANG J.Chinese NER using lattice LSTM[J].arXiv:1805.02023,2018.
[23]XIE B H,ZHANG L L,ZHAO H Y.Chinese Named Entity Revognition Method Based on BERT-DeepCAN-CRF[J].Computer & Digital Engineering,2022,50(12):2720-2726.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!