基于自适应隐马尔可夫模型的石油领域文档分词

计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 97-100.

基于自适应隐马尔可夫模型的石油领域文档分词

宫法明,朱朋海

中国石油大学华东计算机与通信工程学院山东青岛266580

出版日期:2018-06-20 发布日期:2018-08-03
作者简介:宫法明 (1970-),男,硕士,副教授,主要研究方向为计算机图像图像处理、大数据智能处理与云计算,E-mail:z15070507@s.upc.edu.cn;朱朋海(1992－),男,硕士生,主要研究方向为计算机图像处理、自然语言处理。
基金资助:
科技部创新方法工作:大数据环境下的油气开采创新方法研究与应用示范(2015IM010300)资助

Word Segmentation Based on Adaptive Hidden Markov Model in Oilfield

GONG Fa-ming,ZHU Peng-hai

College of Computer & Communication Engineering,China University of Petroleum,Qingdao,Shandong 266580,China

Online:2018-06-20 Published:2018-08-03

摘要/Abstract

摘要： 中文分词技术是把没有分割标志的汉字串转换为符合语言应用特点的词串的过程,是构建石油领域本体的第一步。石油领域的文档有其独有的特点,分词更加困难,目前仍然没有有效的分词算法。通过引入术语集,在隐马尔可夫分词模型的基础上,提出了一种基于自适应隐马尔可夫模型的分词算法。该算法以自适应隐马尔可夫模型为基础,结合领域词典和互信息,以语义约束和词义约束校准分词,实现对石油领域专业术语和组合词的精确识别。通过与中科院的NLPIR汉语分词系统进行对比,证明了所提算法进行分词时的准确率和召回率有显著提高。

关键词: 石油, 隐马尔可夫模型, 中文分词, 组合词

Abstract: The Chinese word segmentation is the first step in constructing the petroleum field ontology.Documents in petroleum field have their own unique characteristics which make word segmentation more complex.Until now,there is no effective word segmentation algorithm,especially for Chinese characters.Based on the hidden Markovian model,an adaptive hidden Markovian word segmentation model was proposed in this paper,which combines the domain-knowledge dictionary and user-defined information,by introducing the terminology set.The proposed algorithm calibrates word segmentation under semantic constraints and word meaning constraints,and can identify professional terms and character combinations in the field of petroleum accurately.It is also proved that the proposed algorithm achieves remarkable improvements in both accuracy and recall rate in word segmentation,compared to the NLPIR Chinese word segmentation system invented by Chinese Academy of Science.

Key words: Chinese word segmentation, Combined character, Hidden Markov model, Petroleum

中图分类号:

TP391

宫法明,朱朋海. 基于自适应隐马尔可夫模型的石油领域文档分词[J]. 计算机科学, 2018, 45(6A): 97-100. https://doi.org/

GONG Fa-ming,ZHU Peng-hai. Word Segmentation Based on Adaptive Hidden Markov Model in Oilfield[J]. Computer Science, 2018, 45(6A): 97-100. https://doi.org/

参考文献

[1]来斯惟,徐立恒,陈玉博,等.基于表示学习的中文分词算法探索[J].中文信息学报,2013,27(5):8-14.
[2]JOHNSON E K,TYLER M D.Testing the limits of statistical learning for word segmentation[J].Developmental Science,2010,13(2):339-345.
[3]FU G,LUKE K K.A two-stage statistical word segmentation system for Chinese[C]∥Sighan Workshop on Chinese Language Processing.Association for Computational Linguistics,2003:156-159.
[4]WANG J.A Rule-based Methodology and Feature-based Methodology for Effect Relation Extraction in Chinese Unstructured Text[D].Dydney:University of Sydney,2015.
[5]SILVA D C,BRAGA D,RESENDE F G V J.A rule-based method for homograph disambiguation in brazilian portuguese text-to-speech systems[J].Journal of Communication and Information Systems,2015,27(1).
[6]AKEN J R V.A statistical learning algorithm for word segmen- tation[J/OL].Computer Science,https//arixv.org/ftp/arxiv/papers/1105/1105.6162.pdf.
[7]TOHTI T,MUSAJAN W,HAMDULLA A.Unsupervised Learn- ing and Linguistic Rule Based Algorithm for Uyghur Word Segmentation[J].Journal of Multimedia,2014,9(5):627-634.
[8]HONGBO POSTGRADUATE L I.Dictionary and Statistical Analysis Combined Algorithm for Chinese Word Segmentation[J].Journal of Wuhan University of Technology,2010(12):907-909.
[9]BHEGANAN P,NAYAK R,XU Y.Thai Word Segmentation with Hidden Markov Model and Decision Tree[C]∥Pacific-Asia Conference on Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2009:74-85.
[10]李月伦,常宝宝.基于最大间隔马尔可夫网模模型的汉语分词方法[J].中文信息学报,2010,24(1):8-14.
[11]PANG B,SHI H.Research on Improved Algorithm for Chinese Word Segmentation Based on Markov Chain[C]∥International Conference on Information Assurance and Security.IEEE,2009:236-238.
[12]OH-WOOK K.Korean Word Segmentation and Compound-noun Decomposition Using Markov Chain and Syllable N-gram[J].Journal of the Acoustical Society of Korea,2002,21(3):274-284.
[13]刁毓.基于本体的中文分词算法的研究与实现[D].曲阜:曲阜师范大学,2012.
[14]李良洁.基于统计和语义信息的中文分词算法研究[D].青岛:青岛科技大学,2015.

相关文章 15

[1]	费星瑞, 谢逸. 基于HMM-NN的用户点击流识别 Click Streams Recognition for Web Users Based on HMM-NN 计算机科学, 2022, 49(7): 340-349. https://doi.org/10.11896/jsjkx.210600127
[2]	王欣, 向明月, 李思颖, 赵若成. 基于隐马尔可夫模型的铁路出行团体关系预测研究 Relation Prediction for Railway Travelling Group Based on Hidden Markov Model 计算机科学, 2022, 49(6A): 247-255. https://doi.org/10.11896/jsjkx.210500001
[3]	刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[4]	张静宣, 江贺. 代码标识符归一化研究现状及发展趋势 Research Status and Development Trend of Identifier Normalization 计算机科学, 2020, 47(3): 1-4. https://doi.org/10.11896/jsjkx.200200009
[5]	张成伟, 罗凤娥, 代毅. 基于数据挖掘的指定航班计划延误预测方法 Prediction Method of Flight Delay in Designated Flight Plan Based on Data Mining 计算机科学, 2020, 47(11A): 464-470. https://doi.org/10.11896/jsjkx.200600001
[6]	张经, 杨健, 苏鹏. 语音识别中单音节识别研究综述 Survey of Monosyllable Recognition in Speech Recognition 计算机科学, 2020, 47(11A): 172-174. https://doi.org/10.11896/jsjkx.200200006
[7]	岳鑫, 杜军威, 胡强, 王延平. 一种故障树结构匹配算法及其应用 Fault Tree Structure Matching Algorithm and Its Application 计算机科学, 2018, 45(9): 202-206. https://doi.org/10.11896／j.issn.1002-137X.2018.09.033
[8]	宫法明,李翛然. 基于Neo4j的海量石油领域本体数据存储研究 Research on Ontology Data Storage of Massive Oil Field Based on Neo4j 计算机科学, 2018, 45(6A): 549-554.
[9]	佟振明, 刘志鹏. 大型多人在线角色扮演游戏的下一地点预测 Next Place Prediction of Massively Multiplayer Online Role-playing Games 计算机科学, 2018, 45(11A): 453-457.
[10]	李佳,郭剑毅,刘艳超,余正涛,线岩团,阮氏青娥. 基于多分类器加权投票法的越南语组合歧义消歧 Vietnamese Combinational Ambiguity Disambiguation Based on Weighted Voting Method of Multiple Classifiers 计算机科学, 2018, 45(1): 167-172. https://doi.org/10.11896/j.issn.1002-137X.2018.01.029
[11]	李金廷,侯宏旭,武静,王洪彬,樊文婷. 语料预处理对蒙古文-汉文统计机器翻译的影响 Effect of Preprocessing on Corpus of Mongolian-Chinese Statistical Machine Translation 计算机科学, 2017, 44(10): 259-264. https://doi.org/10.11896/j.issn.1002-137X.2017.10.047
[12]	童名文,牛琳,杨琳,邹军华,上超望. 课程本体自动构建技术研究 Research on Technique of Course Ontology Automatically Constructing 计算机科学, 2016, 43(Z11): 108-112. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.023
[13]	张向刚,唐海,付常君,石宇亮. 一种基于隐马尔科夫模型的步态识别算法 Gait Recognition Algorithm Based on Hidden Markov Model 计算机科学, 2016, 43(7): 285-289. https://doi.org/10.11896/j.issn.1002-137X.2016.07.052
[14]	王青松,魏如玉. 基于短语的贝叶斯中文垃圾邮件过滤方法 Bayesian Chinese Spam Filtering Method Based on Phrases 计算机科学, 2016, 43(4): 256-259. https://doi.org/10.11896/j.issn.1002-137X.2016.04.052
[15]	梁喜涛,顾磊. 基于最近邻的主动学习分词方法 Active Learning in Chinese Word Segmentation Based on Nearest Neighbor 计算机科学, 2015, 42(6): 228-232. https://doi.org/10.11896/j.issn.1002-137X.2015.06.048

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed