计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 97-100.
宫法明,朱朋海
GONG Fa-ming,ZHU Peng-hai
摘要: 中文分词技术是把没有分割标志的汉字串转换为符合语言应用特点的词串的过程,是构建石油领域本体的第一步。石油领域的文档有其独有的特点,分词更加困难,目前仍然没有有效的分词算法。通过引入术语集,在隐马尔可夫分词模型的基础上,提出了一种基于自适应隐马尔可夫模型的分词算法。该算法以自适应隐马尔可夫模型为基础,结合领域词典和互信息,以语义约束和词义约束校准分词,实现对石油领域专业术语和组合词的精确识别。通过与中科院的NLPIR汉语分词系统进行对比,证明了所提算法进行分词时的准确率和召回率有显著提高。
中图分类号:
[1]来斯惟,徐立恒,陈玉博,等.基于表示学习的中文分词算法探索[J].中文信息学报,2013,27(5):8-14. [2]JOHNSON E K,TYLER M D.Testing the limits of statistical learning for word segmentation[J].Developmental Science,2010,13(2):339-345. [3]FU G,LUKE K K.A two-stage statistical word segmentation system for Chinese[C]∥Sighan Workshop on Chinese Language Processing.Association for Computational Linguistics,2003:156-159. [4]WANG J.A Rule-based Methodology and Feature-based Methodology for Effect Relation Extraction in Chinese Unstructured Text[D].Dydney:University of Sydney,2015. [5]SILVA D C,BRAGA D,RESENDE F G V J.A rule-based method for homograph disambiguation in brazilian portuguese text-to-speech systems[J].Journal of Communication and Information Systems,2015,27(1). [6]AKEN J R V.A statistical learning algorithm for word segmen- tation[J/OL].Computer Science,https//arixv.org/ftp/arxiv/papers/1105/1105.6162.pdf. [7]TOHTI T,MUSAJAN W,HAMDULLA A.Unsupervised Learn- ing and Linguistic Rule Based Algorithm for Uyghur Word Segmentation[J].Journal of Multimedia,2014,9(5):627-634. [8]HONGBO POSTGRADUATE L I.Dictionary and Statistical Analysis Combined Algorithm for Chinese Word Segmentation[J].Journal of Wuhan University of Technology,2010(12):907-909. [9]BHEGANAN P,NAYAK R,XU Y.Thai Word Segmentation with Hidden Markov Model and Decision Tree[C]∥Pacific-Asia Conference on Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2009:74-85. [10]李月伦,常宝宝.基于最大间隔马尔可夫网模模型的汉语分词方法[J].中文信息学报,2010,24(1):8-14. [11]PANG B,SHI H.Research on Improved Algorithm for Chinese Word Segmentation Based on Markov Chain[C]∥International Conference on Information Assurance and Security.IEEE,2009:236-238. [12]OH-WOOK K.Korean Word Segmentation and Compound-noun Decomposition Using Markov Chain and Syllable N-gram[J].Journal of the Acoustical Society of Korea,2002,21(3):274-284. [13]刁毓.基于本体的中文分词算法的研究与实现[D].曲阜:曲阜师范大学,2012. [14]李良洁.基于统计和语义信息的中文分词算法研究[D].青岛:青岛科技大学,2015. |
[1] | 费星瑞, 谢逸. 基于HMM-NN的用户点击流识别 Click Streams Recognition for Web Users Based on HMM-NN 计算机科学, 2022, 49(7): 340-349. https://doi.org/10.11896/jsjkx.210600127 |
[2] | 王欣, 向明月, 李思颖, 赵若成. 基于隐马尔可夫模型的铁路出行团体关系预测研究 Relation Prediction for Railway Travelling Group Based on Hidden Markov Model 计算机科学, 2022, 49(6A): 247-255. https://doi.org/10.11896/jsjkx.210500001 |
[3] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[4] | 张静宣, 江贺. 代码标识符归一化研究现状及发展趋势 Research Status and Development Trend of Identifier Normalization 计算机科学, 2020, 47(3): 1-4. https://doi.org/10.11896/jsjkx.200200009 |
[5] | 张成伟, 罗凤娥, 代毅. 基于数据挖掘的指定航班计划延误预测方法 Prediction Method of Flight Delay in Designated Flight Plan Based on Data Mining 计算机科学, 2020, 47(11A): 464-470. https://doi.org/10.11896/jsjkx.200600001 |
[6] | 张经, 杨健, 苏鹏. 语音识别中单音节识别研究综述 Survey of Monosyllable Recognition in Speech Recognition 计算机科学, 2020, 47(11A): 172-174. https://doi.org/10.11896/jsjkx.200200006 |
[7] | 岳鑫, 杜军威, 胡强, 王延平. 一种故障树结构匹配算法及其应用 Fault Tree Structure Matching Algorithm and Its Application 计算机科学, 2018, 45(9): 202-206. https://doi.org/10.11896/j.issn.1002-137X.2018.09.033 |
[8] | 宫法明,李翛然. 基于Neo4j的海量石油领域本体数据存储研究 Research on Ontology Data Storage of Massive Oil Field Based on Neo4j 计算机科学, 2018, 45(6A): 549-554. |
[9] | 佟振明, 刘志鹏. 大型多人在线角色扮演游戏的下一地点预测 Next Place Prediction of Massively Multiplayer Online Role-playing Games 计算机科学, 2018, 45(11A): 453-457. |
[10] | 李佳,郭剑毅,刘艳超,余正涛,线岩团,阮氏青娥. 基于多分类器加权投票法的越南语组合歧义消歧 Vietnamese Combinational Ambiguity Disambiguation Based on Weighted Voting Method of Multiple Classifiers 计算机科学, 2018, 45(1): 167-172. https://doi.org/10.11896/j.issn.1002-137X.2018.01.029 |
[11] | 李金廷,侯宏旭,武静,王洪彬,樊文婷. 语料预处理对蒙古文-汉文统计机器翻译的影响 Effect of Preprocessing on Corpus of Mongolian-Chinese Statistical Machine Translation 计算机科学, 2017, 44(10): 259-264. https://doi.org/10.11896/j.issn.1002-137X.2017.10.047 |
[12] | 童名文,牛琳,杨琳,邹军华,上超望. 课程本体自动构建技术研究 Research on Technique of Course Ontology Automatically Constructing 计算机科学, 2016, 43(Z11): 108-112. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.023 |
[13] | 张向刚,唐海,付常君,石宇亮. 一种基于隐马尔科夫模型的步态识别算法 Gait Recognition Algorithm Based on Hidden Markov Model 计算机科学, 2016, 43(7): 285-289. https://doi.org/10.11896/j.issn.1002-137X.2016.07.052 |
[14] | 王青松,魏如玉. 基于短语的贝叶斯中文垃圾邮件过滤方法 Bayesian Chinese Spam Filtering Method Based on Phrases 计算机科学, 2016, 43(4): 256-259. https://doi.org/10.11896/j.issn.1002-137X.2016.04.052 |
[15] | 梁喜涛,顾磊. 基于最近邻的主动学习分词方法 Active Learning in Chinese Word Segmentation Based on Nearest Neighbor 计算机科学, 2015, 42(6): 228-232. https://doi.org/10.11896/j.issn.1002-137X.2015.06.048 |
|