计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 167-172.doi: 10.11896/j.issn.1002-137X.2018.01.029

• 第十六届中国机器学习会议 • 上一篇    下一篇

基于多分类器加权投票法的越南语组合歧义消歧

李佳,郭剑毅,刘艳超,余正涛,线岩团,阮氏青娥   

  1. 昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500;昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500;昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500;昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学国际学院 昆明650093
  • 出版日期:2018-01-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61262041,61562052,61472168),云南省自然科学基金重点项目(2013FA030)资助

Vietnamese Combinational Ambiguity Disambiguation Based on Weighted Voting Method of Multiple Classifiers

LI Jia, GUO Jian-yi, LIU Yan-chao, YU Zheng-tao, XIAN Yan-tuan and NGUY~N Qing’e   

  • Online:2018-01-15 Published:2018-11-13

摘要: 组合歧义消解是分词中的关键问题之一,直接影响到分词的准确率。为了解决越南语组合歧义对分词的影响问题,结合越南语组合型词的特点,提出了一种基于集成学习的越南语组合歧义消解方法。该方法首先通过人工选取越南语组合歧义词,构建出越南语组合歧义字段库,对越南语语料与越南语组合词词典进行匹配,抽取出越南语组合歧义字段;其次,采用三类分类器引入越南语词频特征和上下文信息,构建三类分类器消解模型,得到三类分类器消解结果;最后,计算出各分类器权值,通过阈值对越南语组合歧义进行最终分类。实验表明,所提方法的正确率达到了83.32%,与消歧结果最好的单个分类器相比准确率提高了5.81%。

关键词: 组合词词典,组合歧义消解,越南语,集成学习,加权投票法

Abstract: Combinational ambiguity disambiguation is one of the key issues in participle and it directly affects the accuracy of participle.In order to solve the impact problem of combinational ambiguity on the participle in Vietnamese,combining the features of combinational words of Vietnamese,the paper proposed a Vietnamese combinational ambiguity disambiguation method based on integrated Learning.This method first selects Vietnamese combination of polysemy manually,constructs the Vietnamese combinational ambiguities library, matches Vietnamese and Vietnamese combinational-word dictionary,and extracts Vietnamese combinational ambiguities.Secondly,by using three kinds of classifiers to bring in Vietnamese word frequency features and context information,it constructs three class classifier degradation model,and gets the results.Finally,it calculats the classifier weights through the threshold to determine the final classification of Vietnamese combination ambiguity.Experiments show that the proposed method has the accuracy of 83.32% and its accuracy improves 5.81% compared with the single classifier.

Key words: Combinational-word dictionary,Combinational ambiguity disambiguation,Vietnamese,Integrated learning,Weighted voting method

[1] BAR-HILLEL Y.The present status of automatic translation of languages[J].Advances in Computers,1960,1:91-163.
[2] SCHMID H.Tokenizing.In:Anke Lüdeling and Merja Kyt[M]∥An International Handbook.Mouton de Gruyter,Berlin,2007.
[3] LIANG N Y.Written Chinese divided into automatic system—CDWS [J].Journal of Chinese Information Processing,1987,1(2):46-54.(in Chinese) 梁南元.书面汉语自动分词系统—CDWS[J].中文信息学报,1987,1(2):46-54.
[4] L H P N T M,HUY 'n A R,Vinh H T.A Hybrid Approach to Word Segmentation of Vietnamese Texts[C]∥Proceedings of the 2nd International Conference on Language and Automata Theory and Applications.2008.
[5] FENG S Q,CHEN H M.Context-based Approach to Combinational Ambiguity Resolution in Chinese Word Segmentation[J].Journal of Chinese Information Processing,2007,21(6):13-16.(in Chinese) 冯素琴,陈惠明.基于语境信息的汉语组合型歧义消歧方法[J].中文信息学报,2007,21(6):13-16.
[6] NGO Q H,DIEN D,WINIWARTER W.A hybrid method for word segmentation with English-Vietnamese bilingual text[C]∥ 2013 International Conference on Control,Automation and Information Sciences (ICCAIS).IEEE,2013:48-52.
[7] WANG S L,WANG B.A Chinese Overlapping Ambiguity Resolution Method Based on Coupling Degree of Double Characters [J].Journal of Chinese Information Processing,2007,21(5):14-17.(in Chinese) 王思力,王斌.基于双字耦合度的中文分词交叉歧义处理方法[J].中文信息学报,2007,21(5):14-17.
[8] LI M,GAO J,HUANG C,et al.Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation[C]∥Proceedings of the second SIGHAN workshop on Chinese language processing.Association for Computational Linguistics,2003:1-7.
[9] XIONG M M.Vietnamese news event element extraction me-thod study[D].Kunming:Kunming University of Science and Technology,2016.(in Chinese) 熊明明.越南语词法分析研究[D].昆明:昆明理工大学,2016.
[10] PHAM D D,TRAN G B,PHAM S B.A hybrid approach tovietnamese word segmentation using part of speech tags[C]∥International Conference on Knowledge and Systems Enginee-ring,2009(KSE’09).IEEE,2009:154-161.
[11] QIN Y,WANG X J,ZHANG S X.Research on Combinational Ambiguity in Chinese Word Segmentation [J].Journal of Chinese Information Processing,2007,21(1):1-8.(in Chinese) 秦颖,王小捷,张素香.汉语分词中组合歧义字段的研究[J].中文信息学报,2007,21(1):1-8.
[12] ZHANG Y H,PAN L L,PENG Z P,et al.Resolving combinational ambiguity in Chinese word segmentation based on rule mining and Naive Bayes method [J].Journal of Computer Applications,2008,28(7):1686-1688.(in Chinese) 张严虎,潘璐璐,彭子平,等.基于规则挖掘和Naive Bayes 方法的组合型歧义字段切分[J].计算机应用,2008,28(7):1686-1688.
[13] SAHA S,EKBAL A.Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition[J].Data & Knowledge Engineering,2013,85:15-39.
[14] REMYA K R,RAMYA J S.Using weighted majority votingclassifier combination for relation classification in biomedical texts[C]∥2014 International Conference on Control,Instrumentation,Communication and Computational Technologies (ICCICCT).IEEE,2014:1205-1209.
[15] REYHANIAN S,ARBABI E.Weighted Vote Fusion in prototype random subspace for thermal to visible face recognition[C]∥2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA).IEEE,2015:1-5.
[16] NIKAN S,AHMADI M.Human face recognition under occlusion using lbp and entropy weighted voting[C]∥2012 21st International Conference on Pattern Recognition (ICPR).IEEE,2012:1699-1702.
[17] E SILVA R R V,DE ARAUJO F H D,DOS SANTOS L M R,et al.Optic disc detection in retinal images using algorithms committee with weighted voting[J].IEEE Latin America Tran-sactions,2016,14(5):2446-2454.
[18] MAI F,WU S,CUI T.Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields[C]∥Proceedings of the 4th International Conference on Computer Engineering and Networks.Springer International Publishing,2015:599-605.
[19] YAROWSKY D,FLORIAN R.Evaluating Sense Dis2 ambigua-tion Performance Across Diverse Parameter Spaces[J].Natural Language Engineering,2002,8(4):293-310.
[20] LU S,BAI S.Quantitative Analysis of Context Field in Nature Language Processing[J].Chinese Journal of Computers,2001,24(7):742-747.(in Chinese) 鲁松,白硕.自然语言处理中词语上下文有效范围的定量描述[J].计算机学报,2001,24(7):742-747.
[21] DELLA PIETRA S,DELLA PIETRA V,L AFFERTY J.Inducing features of random fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1997,19(4):380-393.
[22] WALLACH H.Efficient training of conditional random fields[D].University of Edinburgh,2002.
[23] BERGER A L,PIETRA V J D,PIETRA S A D.A maximum entropy approach to natural language processing[J].Computational linguistics,1996,22(1):39-71.
[24] VAPNIK V.The nature of statistical learning theory[M].Springer Science & Business Media,2013.
[25] VAPNIK V N,VAPNIK V.Statistical learning theory[M].New York:Wiley,1998.
[26] LI Y,TAX D M J,DUIN R P W,et al.Multiple-instance lear-ning as a classifier combining problem[J].Pattern Recognition,2013,46(3):865-874.
[27] 周志华.机器学习[M].北京:清华大学出版社,2016:171-184.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!