计算机科学 ›› 2023, Vol. 50 ›› Issue (7): 221-228.doi: 10.11896/jsjkx.220700074

• 人工智能 • 上一篇    下一篇

基于信息熵-切分概率模型的新词发现方法

祝钰莹1,2, 郭燕1,2, 万亿兆2, 田凯2   

  1. 1 中国科学技术大学苏州高等研究院 江苏 苏州 215123
    2 中国科学技术大学软件学院 江苏 苏州 215123
  • 收稿日期:2022-07-07 修回日期:2022-10-23 出版日期:2023-07-15 发布日期:2023-07-05
  • 通讯作者: 郭燕(guoyan@ustc.edu.cn)
  • 作者简介:(hiyazyy@mail.ustc.edu.cn)

New Word Detection Based on Branch Entropy-Segmentation Probability Model

ZHU Yuying1,2, GUO Yan1,2, WAN Yizhao2, TIAN Kai2   

  1. 1 Suzhou Institute for Advanced Research,University of Science and Technology of China,Suzhou,Jiangsu 215123,China
    2 School of Software Engineering,University of Science and Technology of China,Suzhou,Jiangsu 215123,China
  • Received:2022-07-07 Revised:2022-10-23 Online:2023-07-15 Published:2023-07-05
  • About author:ZHU Yuying,born in 1997,master.Her main research interests include NLP and machine learning.GUO Yan,born in 1981,lecturer.Her main research interests include information security,NLP and blockchain.

摘要: 新词发现是中文自然语言处理的基本任务,对于提升各种下游任务的性能至关重要。文中提出了一种基于信息熵-切分概率模型的新词发现方法,该方法首先基于信息熵从待处理文本中生成候选词集,然后对候选词集进行切分概率计算,从而筛选出真正的新词。针对有无待处理文本相关的标注语料,提出了两种不同的模型。在缺少待处理文本相关标注语料的情况下,使用通用的分词基准数据集训练了多标签Transformer-CRF模型;在具有专业标注语料的情况下,则引入了基于键值的记忆神经网络,以充分融合词语成词信息。实验结果表明,多标签Transformer-CRF模型在Top900词中法律相关词的MAP高达54.00%,较无监督方法生成的候选词集提升了2.15%;在对法律专业语料提取新词时,键值记忆神经网络的表现进一步超过了多标签Transformer-CRF模型,达到了3.43%的效果提升。

关键词: 新词发现, 信息熵, 互信息, Transformer, 条件随机场, 键值记忆神经网络

Abstract: As a basic task of Chinese natural language processing,new word detection is crucial for improving the performance of various downstream tasks.This paper proposes a new word detection method based on branch entropy and segmentation probabi-lity.The method firstly generates a candidate word set from the text based on branch entropy,and then calculates the segmentation probability of each candidate,so as to filter out the noisy word.Two different models are proposed to respectively deal with situations whether or not there are annotated corpus related to the text to be processed.In the absence of related segmented corpus,the multi-criteria Transformer-CRF model is trained using general segmented benchmark data sets.A key-value based memory neural network is introduced to fully extract the wordhood information if there is field-specific segmented corpus.Experimental results show that the multi-criteria Transformer-CRF model has a MAP of 54.00% of legal texts in the top 900 resulted words,which is 2.15% higher than that of the unsupervised method.As with segmented legal corpus,the performance of the key-value memory neural network further exceeds the former model,has an improvement of 3.43%.

Key words: New word detection, Branch entropy, Mutual information, Transformer, Conditional random fields, Key-value memory neural networks

中图分类号: 

  • TP391
[1]NGUYEN T H,SHIRAI K.Topic modeling based sentimentanalysis on social media for stock market prediction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2015:1354-1364.
[2]DONG G,LI R,YANG W,et al.Microblog burst keywords detection based on social trust and dynamics model[J].Chinese Journal of Electronics,2014,23(4):695-700.
[3]CHENG N C,HOU M,TENG Y L.Short text attitude analysis based on textual characteristic[J].Journal of Chinese Information Processing,2015,29(2):163-169.
[4]ZHAO Z B,SHI Y X,LI B Y.Newly-emerging domain word detection method based on syntactic analysis and term vector[J].Computer Science,2019,46(6):29-34.
[5]LIU Y T,WU B,XIE T,et al.New word detection in ancient Chinese corpus[J].Journal of Chinese Information Processing,2019,33(1):46-55.
[6]TUNG C H,LEE H J.Identification of unknown words from corpus[J].Computational Proceedings of Chinese and Oriental Languages,1994,8:131-145.
[7]CHURCH K,HANKS P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29.
[8]FENG H,CHEN K,DENG X,et al.Accessor variety criteria for Chinese word extraction[J].Computational Linguistics,2004,30(1):75-93.
[9]BU F,ZHU X,LI M.Measuring the non-compositionality ofmultiword expressions[C]//Proceedings of the 23rd International Conference on Computational Linguistics.2010:116-124.
[10]DENG K,BOL P K,LI K J,et al.On the unsupervised analysis of domain-specific Chinese texts[C]//Proceedings of the National Academy of Sciences.2016:6154-6159.
[11]CHEN A,SUN M.Domain-specific new words detection in Chinese[C]//Proceedings of the 6th Joint Conference on Lexical and Computational Semantics(* SEM 2017).2017:44-53.
[12]PAN C Z,SUN M S,DENG K.TopWORDS-Seg:Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:158-169.
[13]TIAN Y,SONG Y,XIA F,et al.Improving Chinese word segmentation with wordhood memory networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computa-tional Linguistics.2020:8274-8285.
[14]QIU X P,PEI H Z,YAN H,et al.A concise model for multi-criteria Chinese word segmentation with transformer encoder[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.2020:2887-2897.
[15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008.
[16]EMERSON T.The second internationalChinese word segmentation bakeoff[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.2005.
[17]JIN G,CHEN X.The fourth international Chinese language processing bakeoff:Chinese word segmentation,named entity re-cognition and Chinese pos tagging[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.2008.
[18]AUNAK V,GUPTA V,METZE F.Effective dimensionality reduction for word embeddings[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.2019:235-243.
[19]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP).2018:4171-4186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!