Computer Science ›› 2017, Vol. 44 ›› Issue (10): 259-264.doi: 10.11896/j.issn.1002-137X.2017.10.047

Previous Articles     Next Articles

Effect of Preprocessing on Corpus of Mongolian-Chinese Statistical Machine Translation

LI Jin-ting, HOU Hong-xu, WU Jing, WANG Hong-bin and FAN Wen-ting   

  • Online:2018-12-01 Published:2018-12-01

Abstract: The traditional methods of morphology preprocessing use Mongolian suffix segmentation and stemming,which leads to semantic loss of the words.The additional components of Case is a special additional component of the Mongolian word suffix which only represents the syntactic information of the sentence but not the semantic information of the words.Inappropriate preprocessing of the Case causes data sparsity to the machine translation training.Therefore,we summarized and researched the existing corpus preprocessing method of Mongolian morphology to compare the results.Our methods mainly focus on the effect of Case processing and improve the performance of Mongolian-Chinese SMT system of 3.22 relative BLEU score compared to the baseline system.

Key words: Statistical machine translation,Corpus preprocessing,Mongolian morphological analysis,Case processing,Latinization,Chinese word segmentation

[1] NICOLAI G,KONDRAK G.Leveraging inflection tables for stemming and lemmatization[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1138-1147.
[2] NA S W.Mongolian word root,stem,suffix automatic segmentation system[J].Journal of Inner Mongolia University(Humanities and Social Sciences Edition),1997(2):53-57.(in Chinese) 那顺乌日图.蒙古文词根、词干、词尾的自动切分系统[J].内蒙古大学学报(人文社会科学版),1997(2):53-57.
[3] SINGH J,GUPTA V.Text Stemming:Approaches,Applica-tions,and Challenges[J].ACM Computing Surveys(CSUR),2016,49(3):45.
[4] WU J,HOU H X,BAO F L,et al.Template-based model for BiRNN Mongolian-Chinese machine translation[C]∥Procee-dings of TAAI 2015.2015.
[5] HOU H X,LIU Q,NA S W,et al.Mongolian Word Segmentation Based on Statistical Language Model[J].Pattern Recognition and Artificial Intelligence,2009,2(1):108-112.(in Chinese) 侯宏旭,刘群,那顺乌日图,等.基于统计语言模型的蒙古文词切分[J].模式识别与人工智能,2009,2(1):108-112.
[6] ZHAO W,HOU H X,CONG W,et al.Research on Conditional Random Fields Based Mongolian Word Segmentation[J].Journal of Chinese Information Processing,2010,4(5):31-35.(in Chinese) 赵伟,侯宏旭,从伟,等.基于条件随机场的蒙古文词切分研究[J].中文信息学报,2010,24(5):31-35.
[7] MING Y.Researching of Mongolian Word Segmentation System Based On Dictionary,Rules and Language Model[D].Hohhot:Inner Mongolia University,2011.(in Chinese) 明玉.基于词典、规则与统计的蒙古文词切分系统的研究[D].呼和浩特:内蒙古大学,2011.
[8] 申晓亭.少数民族文字拉丁转写的意义与方案[C]∥全国少数民族语言文字信息处理学术研讨会.2007.
[9] XU J J,SUN X.Dependency-based gated recursive neural network for Chinese word segmentation[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:567-572.
[10] ZHANG R,YASUDA K,SUMITA E.Improved statistical machine translation by multiple Chinese word segmentation[C]∥Proceedings of the Third Workshop on Statistical Machine Translation.Ohio:Association for Computational Linguistics,2008:216-223.
[11] HUANG C N,ZHAO H.Chinese Word Segmentation:A Decade Review[J].Journal of Chinese Information Processing,2007(3):8-19.(in Chinese) 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007(3):8-19.
[12] 陈晓,靳光瑾,黄昌宁.基于字的分词方法的实验研究:第九届全国计算语言学学术会议[C]∥全国计算语言学学术会议.2007:52-57.
[13] FENG G H.Review of Performance Evaluation of Text Classification [J].Journal of Intelligence,2011,0(8):66-70.(in Chinese) 奉国和.文本分类性能评价研究[J].情报杂志,2011,0(8):66-70.
[14] WU J,HOU H X,LI J T,et al.Adapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation[C]∥International Conference on Computer Processing of Oriental Languages.Kunming,China:Springer International Publishing,2016:470-480.
[15] SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin (Germany):Association for Computational Linguistics,2016:1715-1725.
[16] LEE J,CHO K,HOFMANN T.Fully Character-Level NeuralMachine Translation without Explicit Segmentation[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1693-1703.
[17] PRABHU A,JOSHI A,SHRIVASTAVA M,et al.TowardsSub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text[J].ArXiv Preprint ArXiv:1611.00472,6.
[18] OCH F J,NEY H.A systematic comparison of various statistical alignment models[J].Computational Linguistics,2003,29(1):19-51.
[19] KOEHN P,HOANG H,BIRCH A,et al.Moses:Open sourcetoolkit for statistical machine translation[C]∥Proceedings of the Association for Computational Linguistics.Prague (Czech Republic):Association for Computational Linguistics,2007.
[20] OCH F J.Minimum error rate training in statistical machinetranslation[C]∥Proceedings of the Association for Computational Linguistics.Sapporo,Japan:Association for Computatio-nal Linguistics,2003:440-447.
[21] YANG N.Neural Network Learning for Statistical MachineTranslation[D].Hefei:University of Science and Technology of China,2014.(in Chinese) 杨南.基于神经网络学习的统计机器翻译研究[D].合肥:中国科学技术大学,2014.
[22] KOEHN P,et al.BLEU:a method for automatic evaluation of machine translation[C]∥Proceedings of the 40th Annual Mee-ting on Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:311-318.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!