Computer Science ›› 2023, Vol. 50 ›› Issue (7): 221-228.doi: 10.11896/jsjkx.220700074

• Artificial Intelligence • Previous Articles     Next Articles

New Word Detection Based on Branch Entropy-Segmentation Probability Model

ZHU Yuying1,2, GUO Yan1,2, WAN Yizhao2, TIAN Kai2   

  1. 1 Suzhou Institute for Advanced Research,University of Science and Technology of China,Suzhou,Jiangsu 215123,China
    2 School of Software Engineering,University of Science and Technology of China,Suzhou,Jiangsu 215123,China
  • Received:2022-07-07 Revised:2022-10-23 Online:2023-07-15 Published:2023-07-05
  • About author:ZHU Yuying,born in 1997,master.Her main research interests include NLP and machine learning.GUO Yan,born in 1981,lecturer.Her main research interests include information security,NLP and blockchain.

Abstract: As a basic task of Chinese natural language processing,new word detection is crucial for improving the performance of various downstream tasks.This paper proposes a new word detection method based on branch entropy and segmentation probabi-lity.The method firstly generates a candidate word set from the text based on branch entropy,and then calculates the segmentation probability of each candidate,so as to filter out the noisy word.Two different models are proposed to respectively deal with situations whether or not there are annotated corpus related to the text to be processed.In the absence of related segmented corpus,the multi-criteria Transformer-CRF model is trained using general segmented benchmark data sets.A key-value based memory neural network is introduced to fully extract the wordhood information if there is field-specific segmented corpus.Experimental results show that the multi-criteria Transformer-CRF model has a MAP of 54.00% of legal texts in the top 900 resulted words,which is 2.15% higher than that of the unsupervised method.As with segmented legal corpus,the performance of the key-value memory neural network further exceeds the former model,has an improvement of 3.43%.

Key words: New word detection, Branch entropy, Mutual information, Transformer, Conditional random fields, Key-value memory neural networks

CLC Number: 

  • TP391
[1]NGUYEN T H,SHIRAI K.Topic modeling based sentimentanalysis on social media for stock market prediction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2015:1354-1364.
[2]DONG G,LI R,YANG W,et al.Microblog burst keywords detection based on social trust and dynamics model[J].Chinese Journal of Electronics,2014,23(4):695-700.
[3]CHENG N C,HOU M,TENG Y L.Short text attitude analysis based on textual characteristic[J].Journal of Chinese Information Processing,2015,29(2):163-169.
[4]ZHAO Z B,SHI Y X,LI B Y.Newly-emerging domain word detection method based on syntactic analysis and term vector[J].Computer Science,2019,46(6):29-34.
[5]LIU Y T,WU B,XIE T,et al.New word detection in ancient Chinese corpus[J].Journal of Chinese Information Processing,2019,33(1):46-55.
[6]TUNG C H,LEE H J.Identification of unknown words from corpus[J].Computational Proceedings of Chinese and Oriental Languages,1994,8:131-145.
[7]CHURCH K,HANKS P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29.
[8]FENG H,CHEN K,DENG X,et al.Accessor variety criteria for Chinese word extraction[J].Computational Linguistics,2004,30(1):75-93.
[9]BU F,ZHU X,LI M.Measuring the non-compositionality ofmultiword expressions[C]//Proceedings of the 23rd International Conference on Computational Linguistics.2010:116-124.
[10]DENG K,BOL P K,LI K J,et al.On the unsupervised analysis of domain-specific Chinese texts[C]//Proceedings of the National Academy of Sciences.2016:6154-6159.
[11]CHEN A,SUN M.Domain-specific new words detection in Chinese[C]//Proceedings of the 6th Joint Conference on Lexical and Computational Semantics(* SEM 2017).2017:44-53.
[12]PAN C Z,SUN M S,DENG K.TopWORDS-Seg:Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:158-169.
[13]TIAN Y,SONG Y,XIA F,et al.Improving Chinese word segmentation with wordhood memory networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computa-tional Linguistics.2020:8274-8285.
[14]QIU X P,PEI H Z,YAN H,et al.A concise model for multi-criteria Chinese word segmentation with transformer encoder[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.2020:2887-2897.
[15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008.
[16]EMERSON T.The second internationalChinese word segmentation bakeoff[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.2005.
[17]JIN G,CHEN X.The fourth international Chinese language processing bakeoff:Chinese word segmentation,named entity re-cognition and Chinese pos tagging[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.2008.
[18]AUNAK V,GUPTA V,METZE F.Effective dimensionality reduction for word embeddings[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.2019:235-243.
[19]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP).2018:4171-4186.
[1] BAI Zhengyao, FAN Shenglan, LU Qianjie, ZHOU Xue. COVID-19 Instance Segmentation and Classification Network Based on CT Image Semantics [J]. Computer Science, 2023, 50(6A): 220600142-9.
[2] YANG Jingyi, LI Fang, KANG Xiaodong, WANG Xiaotian, LIU Hanqing, HAN Junling. Ultrasonic Image Segmentation Based on SegFormer [J]. Computer Science, 2023, 50(6A): 220400273-6.
[3] YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang. Text-Image Cross-modal Retrieval Based on Transformer [J]. Computer Science, 2023, 50(4): 141-148.
[4] LIANG Weiliang, LI Yue, WANG Pengfei. Lightweight Face Generation Method Based on TransEditor and Its Application Specification [J]. Computer Science, 2023, 50(2): 221-230.
[5] CAO Jinjuan, QIAN Zhong, LI Peifeng. End-to-End Event Factuality Identification with Joint Model [J]. Computer Science, 2023, 50(2): 292-299.
[6] CAI Xiao, CEHN Zhihua, SHENG Bin. SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing [J]. Computer Science, 2023, 50(1): 105-113.
[7] ZHANG Jingyuan, WANG Hongxia, HE Peisong. Multitask Transformer-based Network for Image Splicing Manipulation Detection [J]. Computer Science, 2023, 50(1): 114-122.
[8] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[9] ZHANG Jia-hao, LIU Feng, QI Jia-yin. Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer [J]. Computer Science, 2022, 49(6A): 370-377.
[10] KANG Yan, XU Yu-long, KOU Yong-qi, XIE Si-yu, YANG Xue-kun, LI Hao. Drug-Drug Interaction Prediction Based on Transformer and LSTM [J]. Computer Science, 2022, 49(6A): 17-21.
[11] ZHAO Xiao-hu, YE Sheng, LI Xiao. Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction [J]. Computer Science, 2022, 49(6): 269-275.
[12] LU Liang, KONG Fang. Dialogue-based Entity Relation Extraction with Knowledge [J]. Computer Science, 2022, 49(5): 200-205.
[13] LI Chuan, LI Wei-hua, WANG Ying-hui, CHEN Wei, WEN Jun-ying. Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1 [J]. Computer Science, 2022, 49(11A): 211000209-6.
[14] WANG Shuai, ZHANG Shu-jun, YE Kang, GUO Qi. Continuous Sign Language Recognition Method Based on Improved Transformer [J]. Computer Science, 2022, 49(11A): 211200198-6.
[15] HU Xin-rong, CHEN Zhi-heng, LIU Jun-ping, PENG Tao, YE Peng, ZHU Qiang. Sentiment Analysis Framework Based on Multimodal Representation Learning [J]. Computer Science, 2022, 49(11A): 210900107-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!