计算机科学 ›› 2023, Vol. 50 ›› Issue (7): 221-228.doi: 10.11896/jsjkx.220700074
祝钰莹1,2, 郭燕1,2, 万亿兆2, 田凯2
ZHU Yuying1,2, GUO Yan1,2, WAN Yizhao2, TIAN Kai2
摘要: 新词发现是中文自然语言处理的基本任务,对于提升各种下游任务的性能至关重要。文中提出了一种基于信息熵-切分概率模型的新词发现方法,该方法首先基于信息熵从待处理文本中生成候选词集,然后对候选词集进行切分概率计算,从而筛选出真正的新词。针对有无待处理文本相关的标注语料,提出了两种不同的模型。在缺少待处理文本相关标注语料的情况下,使用通用的分词基准数据集训练了多标签Transformer-CRF模型;在具有专业标注语料的情况下,则引入了基于键值的记忆神经网络,以充分融合词语成词信息。实验结果表明,多标签Transformer-CRF模型在Top900词中法律相关词的MAP高达54.00%,较无监督方法生成的候选词集提升了2.15%;在对法律专业语料提取新词时,键值记忆神经网络的表现进一步超过了多标签Transformer-CRF模型,达到了3.43%的效果提升。
中图分类号:
[1]NGUYEN T H,SHIRAI K.Topic modeling based sentimentanalysis on social media for stock market prediction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2015:1354-1364. [2]DONG G,LI R,YANG W,et al.Microblog burst keywords detection based on social trust and dynamics model[J].Chinese Journal of Electronics,2014,23(4):695-700. [3]CHENG N C,HOU M,TENG Y L.Short text attitude analysis based on textual characteristic[J].Journal of Chinese Information Processing,2015,29(2):163-169. [4]ZHAO Z B,SHI Y X,LI B Y.Newly-emerging domain word detection method based on syntactic analysis and term vector[J].Computer Science,2019,46(6):29-34. [5]LIU Y T,WU B,XIE T,et al.New word detection in ancient Chinese corpus[J].Journal of Chinese Information Processing,2019,33(1):46-55. [6]TUNG C H,LEE H J.Identification of unknown words from corpus[J].Computational Proceedings of Chinese and Oriental Languages,1994,8:131-145. [7]CHURCH K,HANKS P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29. [8]FENG H,CHEN K,DENG X,et al.Accessor variety criteria for Chinese word extraction[J].Computational Linguistics,2004,30(1):75-93. [9]BU F,ZHU X,LI M.Measuring the non-compositionality ofmultiword expressions[C]//Proceedings of the 23rd International Conference on Computational Linguistics.2010:116-124. [10]DENG K,BOL P K,LI K J,et al.On the unsupervised analysis of domain-specific Chinese texts[C]//Proceedings of the National Academy of Sciences.2016:6154-6159. [11]CHEN A,SUN M.Domain-specific new words detection in Chinese[C]//Proceedings of the 6th Joint Conference on Lexical and Computational Semantics(* SEM 2017).2017:44-53. [12]PAN C Z,SUN M S,DENG K.TopWORDS-Seg:Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:158-169. [13]TIAN Y,SONG Y,XIA F,et al.Improving Chinese word segmentation with wordhood memory networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computa-tional Linguistics.2020:8274-8285. [14]QIU X P,PEI H Z,YAN H,et al.A concise model for multi-criteria Chinese word segmentation with transformer encoder[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.2020:2887-2897. [15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008. [16]EMERSON T.The second internationalChinese word segmentation bakeoff[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.2005. [17]JIN G,CHEN X.The fourth international Chinese language processing bakeoff:Chinese word segmentation,named entity re-cognition and Chinese pos tagging[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.2008. [18]AUNAK V,GUPTA V,METZE F.Effective dimensionality reduction for word embeddings[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.2019:235-243. [19]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP).2018:4171-4186. |
|