计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 167-172.doi: 10.11896/j.issn.1002-137X.2018.01.029

• 第十六届中国机器学习会议 • 上一篇    下一篇

基于多分类器加权投票法的越南语组合歧义消歧

李佳,郭剑毅,刘艳超,余正涛,线岩团,阮氏青娥   

  1. 昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500;昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500;昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500;昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学国际学院 昆明650093
  • 出版日期:2018-01-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61262041,61562052,61472168),云南省自然科学基金重点项目(2013FA030)资助

Vietnamese Combinational Ambiguity Disambiguation Based on Weighted Voting Method of Multiple Classifiers

LI Jia, GUO Jian-yi, LIU Yan-chao, YU Zheng-tao, XIAN Yan-tuan and NGUY~N Qing’e   

  • Online:2018-01-15 Published:2018-11-13

摘要: 组合歧义消解是分词中的关键问题之一,直接影响到分词的准确率。为了解决越南语组合歧义对分词的影响问题,结合越南语组合型词的特点,提出了一种基于集成学习的越南语组合歧义消解方法。该方法首先通过人工选取越南语组合歧义词,构建出越南语组合歧义字段库,对越南语语料与越南语组合词词典进行匹配,抽取出越南语组合歧义字段;其次,采用三类分类器引入越南语词频特征和上下文信息,构建三类分类器消解模型,得到三类分类器消解结果;最后,计算出各分类器权值,通过阈值对越南语组合歧义进行最终分类。实验表明,所提方法的正确率达到了83.32%,与消歧结果最好的单个分类器相比准确率提高了5.81%。

关键词: 组合词词典,组合歧义消解,越南语,集成学习,加权投票法

Abstract: Combinational ambiguity disambiguation is one of the key issues in participle and it directly affects the accuracy of participle.In order to solve the impact problem of combinational ambiguity on the participle in Vietnamese,combining the features of combinational words of Vietnamese,the paper proposed a Vietnamese combinational ambiguity disambiguation method based on integrated Learning.This method first selects Vietnamese combination of polysemy manually,constructs the Vietnamese combinational ambiguities library, matches Vietnamese and Vietnamese combinational-word dictionary,and extracts Vietnamese combinational ambiguities.Secondly,by using three kinds of classifiers to bring in Vietnamese word frequency features and context information,it constructs three class classifier degradation model,and gets the results.Finally,it calculats the classifier weights through the threshold to determine the final classification of Vietnamese combination ambiguity.Experiments show that the proposed method has the accuracy of 83.32% and its accuracy improves 5.81% compared with the single classifier.

Key words: Combinational-word dictionary,Combinational ambiguity disambiguation,Vietnamese,Integrated learning,Weighted voting method

[1] BAR-HILLEL Y.The present status of automatic translation of languages[J].Advances in Computers,1960,1:91-163.
[2] SCHMID H.Tokenizing.In:Anke Lüdeling and Merja Kyt[M]∥An International Handbook.Mouton de Gruyter,Berlin,2007.
[3] LIANG N Y.Written Chinese divided into automatic system—CDWS [J].Journal of Chinese Information Processing,1987,1(2):46-54.(in Chinese) 梁南元.书面汉语自动分词系统—CDWS[J].中文信息学报,1987,1(2):46-54.
[4] L H P N T M,HUY 'n A R,Vinh H T.A Hybrid Approach to Word Segmentation of Vietnamese Texts[C]∥Proceedings of the 2nd International Conference on Language and Automata Theory and Applications.2008.
[5] FENG S Q,CHEN H M.Context-based Approach to Combinational Ambiguity Resolution in Chinese Word Segmentation[J].Journal of Chinese Information Processing,2007,21(6):13-16.(in Chinese) 冯素琴,陈惠明.基于语境信息的汉语组合型歧义消歧方法[J].中文信息学报,2007,21(6):13-16.
[6] NGO Q H,DIEN D,WINIWARTER W.A hybrid method for word segmentation with English-Vietnamese bilingual text[C]∥ 2013 International Conference on Control,Automation and Information Sciences (ICCAIS).IEEE,2013:48-52.
[7] WANG S L,WANG B.A Chinese Overlapping Ambiguity Resolution Method Based on Coupling Degree of Double Characters [J].Journal of Chinese Information Processing,2007,21(5):14-17.(in Chinese) 王思力,王斌.基于双字耦合度的中文分词交叉歧义处理方法[J].中文信息学报,2007,21(5):14-17.
[8] LI M,GAO J,HUANG C,et al.Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation[C]∥Proceedings of the second SIGHAN workshop on Chinese language processing.Association for Computational Linguistics,2003:1-7.
[9] XIONG M M.Vietnamese news event element extraction me-thod study[D].Kunming:Kunming University of Science and Technology,2016.(in Chinese) 熊明明.越南语词法分析研究[D].昆明:昆明理工大学,2016.
[10] PHAM D D,TRAN G B,PHAM S B.A hybrid approach tovietnamese word segmentation using part of speech tags[C]∥International Conference on Knowledge and Systems Enginee-ring,2009(KSE’09).IEEE,2009:154-161.
[11] QIN Y,WANG X J,ZHANG S X.Research on Combinational Ambiguity in Chinese Word Segmentation [J].Journal of Chinese Information Processing,2007,21(1):1-8.(in Chinese) 秦颖,王小捷,张素香.汉语分词中组合歧义字段的研究[J].中文信息学报,2007,21(1):1-8.
[12] ZHANG Y H,PAN L L,PENG Z P,et al.Resolving combinational ambiguity in Chinese word segmentation based on rule mining and Naive Bayes method [J].Journal of Computer Applications,2008,28(7):1686-1688.(in Chinese) 张严虎,潘璐璐,彭子平,等.基于规则挖掘和Naive Bayes 方法的组合型歧义字段切分[J].计算机应用,2008,28(7):1686-1688.
[13] SAHA S,EKBAL A.Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition[J].Data & Knowledge Engineering,2013,85:15-39.
[14] REMYA K R,RAMYA J S.Using weighted majority votingclassifier combination for relation classification in biomedical texts[C]∥2014 International Conference on Control,Instrumentation,Communication and Computational Technologies (ICCICCT).IEEE,2014:1205-1209.
[15] REYHANIAN S,ARBABI E.Weighted Vote Fusion in prototype random subspace for thermal to visible face recognition[C]∥2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA).IEEE,2015:1-5.
[16] NIKAN S,AHMADI M.Human face recognition under occlusion using lbp and entropy weighted voting[C]∥2012 21st International Conference on Pattern Recognition (ICPR).IEEE,2012:1699-1702.
[17] E SILVA R R V,DE ARAUJO F H D,DOS SANTOS L M R,et al.Optic disc detection in retinal images using algorithms committee with weighted voting[J].IEEE Latin America Tran-sactions,2016,14(5):2446-2454.
[18] MAI F,WU S,CUI T.Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields[C]∥Proceedings of the 4th International Conference on Computer Engineering and Networks.Springer International Publishing,2015:599-605.
[19] YAROWSKY D,FLORIAN R.Evaluating Sense Dis2 ambigua-tion Performance Across Diverse Parameter Spaces[J].Natural Language Engineering,2002,8(4):293-310.
[20] LU S,BAI S.Quantitative Analysis of Context Field in Nature Language Processing[J].Chinese Journal of Computers,2001,24(7):742-747.(in Chinese) 鲁松,白硕.自然语言处理中词语上下文有效范围的定量描述[J].计算机学报,2001,24(7):742-747.
[21] DELLA PIETRA S,DELLA PIETRA V,L AFFERTY J.Inducing features of random fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1997,19(4):380-393.
[22] WALLACH H.Efficient training of conditional random fields[D].University of Edinburgh,2002.
[23] BERGER A L,PIETRA V J D,PIETRA S A D.A maximum entropy approach to natural language processing[J].Computational linguistics,1996,22(1):39-71.
[24] VAPNIK V.The nature of statistical learning theory[M].Springer Science & Business Media,2013.
[25] VAPNIK V N,VAPNIK V.Statistical learning theory[M].New York:Wiley,1998.
[26] LI Y,TAX D M J,DUIN R P W,et al.Multiple-instance lear-ning as a classifier combining problem[J].Pattern Recognition,2013,46(3):865-874.
[27] 周志华.机器学习[M].北京:清华大学出版社,2016:171-184.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .