计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 259-264.doi: 10.11896/j.issn.1002-137X.2017.10.047

• 人工智能 • 上一篇    下一篇

语料预处理对蒙古文-汉文统计机器翻译的影响

李金廷,侯宏旭,武静,王洪彬,樊文婷   

  1. 内蒙古大学计算机学院 呼和浩特010021,内蒙古大学计算机学院 呼和浩特010021,内蒙古大学计算机学院 呼和浩特010021,内蒙古大学计算机学院 呼和浩特010021,内蒙古大学计算机学院 呼和浩特010021
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目:跨汉斯拉夫蒙古文的信息检索关键技术研究(61362028),内蒙古自治区研究生科研创新项目:蒙古文-汉文语料预处理关键技术的研究(11200-12110201)资助

Effect of Preprocessing on Corpus of Mongolian-Chinese Statistical Machine Translation

LI Jin-ting, HOU Hong-xu, WU Jing, WANG Hong-bin and FAN Wen-ting   

  • Online:2018-12-01 Published:2018-12-01

摘要: 传统蒙古文形态分析主要采用将蒙古文词缀和词干直接切分而仅保留词干的方法,该方法会丢掉蒙古文词缀所包含的大量语义信息。蒙古文词缀中包含大量格的附加成分,主要表征句子的结构特征,对其进行切分并不会影响词汇的语义特征,若不进行预处理则会造成严重的数据稀疏问题,从而影响翻译质量。因此,基于现有理论对语料预处理方法进行总结研究,重点研究了蒙古文格处理对翻译结果的影响,目的是从蒙古文形态分析的特殊性入手来提高蒙古文-汉文统计机器翻译的质量。通过优化预处理方法,使机器翻译结果的BLEU得分相比基线系统1提高了3.22个点。

关键词: 统计机器翻译,语料预处理,蒙古文形态分析,格处理,拉丁转写,中文分词

Abstract: The traditional methods of morphology preprocessing use Mongolian suffix segmentation and stemming,which leads to semantic loss of the words.The additional components of Case is a special additional component of the Mongolian word suffix which only represents the syntactic information of the sentence but not the semantic information of the words.Inappropriate preprocessing of the Case causes data sparsity to the machine translation training.Therefore,we summarized and researched the existing corpus preprocessing method of Mongolian morphology to compare the results.Our methods mainly focus on the effect of Case processing and improve the performance of Mongolian-Chinese SMT system of 3.22 relative BLEU score compared to the baseline system.

Key words: Statistical machine translation,Corpus preprocessing,Mongolian morphological analysis,Case processing,Latinization,Chinese word segmentation

[1] NICOLAI G,KONDRAK G.Leveraging inflection tables for stemming and lemmatization[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1138-1147.
[2] NA S W.Mongolian word root,stem,suffix automatic segmentation system[J].Journal of Inner Mongolia University(Humanities and Social Sciences Edition),1997(2):53-57.(in Chinese) 那顺乌日图.蒙古文词根、词干、词尾的自动切分系统[J].内蒙古大学学报(人文社会科学版),1997(2):53-57.
[3] SINGH J,GUPTA V.Text Stemming:Approaches,Applica-tions,and Challenges[J].ACM Computing Surveys(CSUR),2016,49(3):45.
[4] WU J,HOU H X,BAO F L,et al.Template-based model for BiRNN Mongolian-Chinese machine translation[C]∥Procee-dings of TAAI 2015.2015.
[5] HOU H X,LIU Q,NA S W,et al.Mongolian Word Segmentation Based on Statistical Language Model[J].Pattern Recognition and Artificial Intelligence,2009,2(1):108-112.(in Chinese) 侯宏旭,刘群,那顺乌日图,等.基于统计语言模型的蒙古文词切分[J].模式识别与人工智能,2009,2(1):108-112.
[6] ZHAO W,HOU H X,CONG W,et al.Research on Conditional Random Fields Based Mongolian Word Segmentation[J].Journal of Chinese Information Processing,2010,4(5):31-35.(in Chinese) 赵伟,侯宏旭,从伟,等.基于条件随机场的蒙古文词切分研究[J].中文信息学报,2010,24(5):31-35.
[7] MING Y.Researching of Mongolian Word Segmentation System Based On Dictionary,Rules and Language Model[D].Hohhot:Inner Mongolia University,2011.(in Chinese) 明玉.基于词典、规则与统计的蒙古文词切分系统的研究[D].呼和浩特:内蒙古大学,2011.
[8] 申晓亭.少数民族文字拉丁转写的意义与方案[C]∥全国少数民族语言文字信息处理学术研讨会.2007.
[9] XU J J,SUN X.Dependency-based gated recursive neural network for Chinese word segmentation[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:567-572.
[10] ZHANG R,YASUDA K,SUMITA E.Improved statistical machine translation by multiple Chinese word segmentation[C]∥Proceedings of the Third Workshop on Statistical Machine Translation.Ohio:Association for Computational Linguistics,2008:216-223.
[11] HUANG C N,ZHAO H.Chinese Word Segmentation:A Decade Review[J].Journal of Chinese Information Processing,2007(3):8-19.(in Chinese) 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007(3):8-19.
[12] 陈晓,靳光瑾,黄昌宁.基于字的分词方法的实验研究:第九届全国计算语言学学术会议[C]∥全国计算语言学学术会议.2007:52-57.
[13] FENG G H.Review of Performance Evaluation of Text Classification [J].Journal of Intelligence,2011,0(8):66-70.(in Chinese) 奉国和.文本分类性能评价研究[J].情报杂志,2011,0(8):66-70.
[14] WU J,HOU H X,LI J T,et al.Adapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation[C]∥International Conference on Computer Processing of Oriental Languages.Kunming,China:Springer International Publishing,2016:470-480.
[15] SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin (Germany):Association for Computational Linguistics,2016:1715-1725.
[16] LEE J,CHO K,HOFMANN T.Fully Character-Level NeuralMachine Translation without Explicit Segmentation[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1693-1703.
[17] PRABHU A,JOSHI A,SHRIVASTAVA M,et al.TowardsSub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text[J].ArXiv Preprint ArXiv:1611.00472,6.
[18] OCH F J,NEY H.A systematic comparison of various statistical alignment models[J].Computational Linguistics,2003,29(1):19-51.
[19] KOEHN P,HOANG H,BIRCH A,et al.Moses:Open sourcetoolkit for statistical machine translation[C]∥Proceedings of the Association for Computational Linguistics.Prague (Czech Republic):Association for Computational Linguistics,2007.
[20] OCH F J.Minimum error rate training in statistical machinetranslation[C]∥Proceedings of the Association for Computational Linguistics.Sapporo,Japan:Association for Computatio-nal Linguistics,2003:440-447.
[21] YANG N.Neural Network Learning for Statistical MachineTranslation[D].Hefei:University of Science and Technology of China,2014.(in Chinese) 杨南.基于神经网络学习的统计机器翻译研究[D].合肥:中国科学技术大学,2014.
[22] KOEHN P,et al.BLEU:a method for automatic evaluation of machine translation[C]∥Proceedings of the 40th Annual Mee-ting on Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:311-318.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .