计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 221200070-6.doi: 10.11896/jsjkx.221200070
王裴岩1, 张莹欣1, 付小强2, 陈佳欣1, 徐楠1, 蔡东风1
WANG Peiyan1, ZHANG Yingxin1, FU Xiaoqiang2, CHEN Jiaxin1, XU Nan1, CAI Dongfeng1
摘要: 中文分词是处理工艺规范文本的一项基本任务,并且在工艺知识图谱与智能问答等下游任务中发挥着重要作用。工艺规范文本分词面临的一个挑战是缺乏高质量标注的语料,特别是面向术语、名词短语、工艺参数、数量词等特殊语言现象的分词规范。文中面向工艺规范文本制定了专用分词规范,收集并标注了一个中文工艺规范文本分词语料(WS-MPST),含11 900个句子与255 160个词,4名标注者分词标注一致性达95.25%。在WS-MPST语料上对著名的BiLSTM-CRF与BERT-CRF模型进行了对比实验,F1值分别达到92.61%与93.69%。实验结果表明,构建专用的工艺规范分词语料是必要的。对实验结果的深入分析揭示了未登录词与中文非中文字符混合构成的词是工艺规范文本分词的难点,也为今后工艺规范文本及相关领域的分词研究提供了一定的指导。
中图分类号:
| [1]China National Committee for Terminology in Science andTech-nology.Mechanical Engineering Terms(Second Edition)[M].Beijing:Science Press,2021. [2]GU X H,BAO J S,LV C F.Assembly semantic informationmodeling based on knowledge graph[J].Aeronautical Manufacturing’ Technology,2021,64(4):74-81. [3]ZHU J N,LIANG Y Q,GU F,et al.Design of knowledge question-answering system for mechanical intelligentmanufacturing based on deep learning[J].Computer Integrated Making System,2019,25(5):1161-1168. [4]CHEN Z Y,BAO J S,ZHENG X H,et al.Semantic recognition method of assembly process based on LSTM[J].Computer Integrated Making System,2021,27(6):1583-1593. [5]HUANG C L,ZHAO H.Chinese Word Segmentation:A DecadeReview[J].Journal of Chinese Information Processing,2007(3):8-19. [6]ZHAO H,CAI D,HUANG C L,et al.Chinese Word Segmentation:Another Decade Review(2007-2017)[J].arXiv:1901.06079,2019. [7]EMERSON T.The Second International Chinese Word Segmentation Bakeoff[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.Jeju Island,Korea,2005:123-133. [8]XUE N W,XIA F,CHIOU F D,et al.The Penn Chinese Treebank:PhraseStructure Annotation of a Large Corpus[J].Natural Language Engineering,2005,11(2):207. [9]HUANG K Y,HUANG D G,LIU Z,et al.A Joint Multiple Criteria Model in Transfer Learning for Cross-domain Chinese Word Segmentation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:3873-3882. [10]TIAN Y,SONG Y,XIA F,et al.Improving Chinese Word Segmentation with Wordhood Memory Networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computa-tional Linguistics.2020. [11]LIU Y,TIAN Y,CHANG T H,et al.Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts[C]//Proceedings of the 20th Workshop on Biomedical Language Processing.2021:213-220. [12]LIU Y,ZHANG Y.Unsupervised Domain Adaptation for Joint Segmentation and POS-Tagging[C]//Proceedings of CoLING 2012.2012:745-754. [13]QIU L K,ZHANG Y.Word Segmentation for Chinese Novels[C]//Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.2015:2440-2446. [14]ZHANG G P,LIU D S,YIN B S,et al.Research on Chinese Word Segmentation for Patent Documents[J].Journal of Chinese Information Processing,2010,24(3):112-116. [15]ZHANG J,ZHANG H C,ZHAI D S,et al.Research of theWord Segmentation for Chinese Patent Claims[J].New Technology of Library and Information Service,2014(9):91-98. [16]YUE J Y,XU J A,ZHANG Y J.Chinese Word Segmentation for Patent Documents[J].Journal of Peking University,2013,49(1):159-164. [17]GB/T 13715-1992,Contemporary Chinese language word se-gmentation specification for information processing[S].Beijing:China Standard Press,1992. [18]HRIPCSAK G,ROTHSCHILD A.Agreement,the F-measure,and Reliability in Information Retrieval[J].Journal of the American medical informatics association,2005,12(3):296-298. [19]HUANG Z,WEI X,KAI Y.Bidirectional LSTM-CRF Modelsfor Sequence Tagging[J].arXiv:1508.01991,2015. [20]LAFFERTY J,MCCALLUM A,PEREIRA F.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of ICML’01.2001:282-289. [21]MA J,GANCHEV K,WEISS D,et al.State-of-the-art Chinese Word Segmentation with Bi-LSTMs[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:4902-4908. [22]GONG J,CHEN X,GUI T,et al.Switch-LSTMs for Multi-Cri-teria Chinese Word Segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6457-6464. [23]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:4171-4186. [24]SUN X,ZHANG Y Z,MATSUZKI T,et al.A discriminative latent variable Chinese segmenter with hybrid word/character information[C]//Proceedings of Human Language Technologies:The Annual Conference of the North American Chapter of the Association for Computational Linguistics.2009:56-64. [25]GB/T 24735-2009.Numbering Method for Machine-BuildingTechnological Documentation[S].Beijing:China Standard Press,2009. |
|
||