计算机科学 ›› 2016, Vol. 43 ›› Issue (3): 54-56.doi: 10.11896/j.issn.1002-137X.2016.03.010

• 第十五届中国机器学习会议 • 上一篇    下一篇

基于条件随机场的泰语音节切分方法

赵世瑜,线岩团,郭剑毅,余正涛,洪玄贵,王红斌   

  1. 昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金:面向互联网的泰语-汉语双语语料获取及对齐方法研究(61363044),国家自然科学基金:面向汉语-泰语跨语言新闻事件检索方法研究(61462054),云南省教育厅重点项目:汉语-泰语跨语言新闻事件检索中的相似度计算研究(2014Z021)资助

Thai Syllable Segmentation Based on Conditional Random Fields

ZHAO Shi-yu, XIAN Yan-tuan, GUO Jian-yi, YU Zheng-tao, HONG Xuan-gui and WANG Hong-bin   

  • Online:2018-12-01 Published:2018-12-01

摘要: 音节是泰语构词和读音的基本单位,泰语音节切分对泰语词法分析、语音合成、语音识别研究具有重要意义。结合泰语音节构成特点,提出基于条件随机场(Conditional Random Fields)的泰语音节切分方法。该方法结合泰语字母类别和字母位置定义特征,采用条件随机场对泰语句子中的字母进行序列标注,实现泰语音节切分。在InterBEST 2009泰语语料的基础上,标注了泰语音节切分语料。针对该语料的实验表明,该方法能有效利用字母类别和字母位置信息实现泰语音节切分,其准确率、召回率和F值分别达到了99.115%、99.284%和99.199%。

关键词: 泰语字母特征,泰语音节,音节切分,条件随机场

Abstract: Syllable is the basic unit of word-formation and pronunciation of Thai.Thai syllable segmentation is significant to lexical analysis,speech synthesis and speech recognition.Combined with the characteristics of Thai syllables,Thai syllable segmentation method based CRFs (Conditional Random Fields) was proposed.In order to achieve Thai syllable segmentation,the algorithm not only combines the Thai alphabet categories and letter position to define features,but also employs CRFs for letters in Thai sentence to do sequence labeling.In this paper,Thai syllable segmentation corpus was marked on the basis of InterBEST 2009.Experiments for the corpus demonstrate the method can effectively achieve Thai syllable segmentation by adopting the category and location information of alphabetical letters,and the va-lues of precision,recall and F reach 99.115%,99.284% and 99.199%.

Key words: Thai character feature,Thai syllable,Syllable segmentation,Conditional random fields

[1] Yamamoto K,Nakagawa S.Comparison of syllab-le-based andphoneme-based DNN-HMM in Japane-se speech recognition[C]∥2014 International Conference Advanced Informatics:Concept,Theory and Application (ICAICTA).Bandung,2014:249-254
[2] Tangwongsan S,Phoophuangpairoj R.Boosting Thai SyllableSpeech Recognition Using Acoustic Models Combination[C]∥International Conference on Computer and Electrical Engineering(ICCEE 2008).2008:568-572
[3] Gu Hung-yan,Lai Ming-uen,Tsai Sung-Feng.Combining HMM Spectrum Models and ANN Pros-ody Models for Speech Synthesis of Syllable Prom-inent Languages[C]∥2010 7th International Symposium Chinese Spoken Language Processing (ISCSLP).Tainan,2010:451-454
[4] Thairatananond Y.Towards the Design of a Thai Text Syllable Analyzer [D].Asian Institute of Technology,1981
[5] Charnyapornpong S.A Thai syllable separation alg-orithm [D].Asian Institute of Technology,1983
[6] Poowarawan Y.Dictionary-based Thai syllable separathion[C]∥Proceedings of the Ninth Electronics Engineering Conference.1986
[7] Aroonmanakun W.Collocation and Thai Word Segmentation[C]∥Proceedings of SNLP-Oriental Cocosda,2002.2002:68-75
[8] Fferty J,McCallum A,Pereira F.Conditional random fieldsProbabilistie models for segmenting and labeling sequence data[C]∥ICML2001.San Francisco:Morgan Kaufmann,2001:282-289
[9] Sproat R,Emerson T.The first international Chines-e word segmentation bakeoff[C]∥2nd SIGHAN Workshop on Chinese Language Processing.Morristown.NJ:ACL,2003:133-143
[10] Zhao Hai,Huang Chang-ning,Li Mu.An improved Chineseword segmentation system with conditional random field[C]∥5th SIGHAN Workshop on Chinese Language Processing.Morristown,NJ:ACL,2006:108-117
[11] Segmentation Guidelines for InterBEST 2009 Thai Word Segmentation:An international episode [EB/OL].http://thailang.nectec.or.th/downloadcenter/index.php?option=com_doc-man&task=cat_view&gid=43&Itemid=61
[12] Boriboon M,et al.BEST Corpus Development and Analysis[C]∥International Conference on Asian Language Processing,2009(IALP’09).2009:322-327

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!