计算机科学 ›› 2016, Vol. 43 ›› Issue (2): 51-56.doi: 10.11896/j.issn.1002-137X.2016.02.011

• 2015年中国计算机学会人工智能会议 • 上一篇    下一篇

采用无标注语料和词“粘连”剔除策略的韵律短语识别

钱揖丽,蔡滢滢   

  1. 山西大学计算机与信息技术学院 太原030006;山西大学计算智能与中文信息处理教育部重点实验室 太原030006,山西大学计算机与信息技术学院 太原030006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学青年基金项目(61005053,61100138),山西省青年科技研究基金资助

Recognition of Prosodic Phrases Based on Unlabeled Corpus and “Adhesion” Culling Strategy

QIAN Yi-li and CAI Ying-ying   

  • Online:2018-12-01 Published:2018-12-01

摘要: 针对人工标注韵律结构获取大规模语料的困难和问题,利用标点符号能够表示停顿的性质,提出一种采用无标注语料和词“粘连”剔除策略的韵律短语识别方法。对标点符号划分等级,并在利用其模拟韵律边界时对其赋予不同的权重。基于无标注语料构建最大熵模型,并采取Top-K方法实现句子韵律短语边界的自动预测。通过计算相邻语法词词性间的互信息对句子进行“粘连”处理,生成“粘连”单元,并对出现在其内部的韵律边界进行剔除,实现韵律短语的自动识别。实验结果表明,获取无标注语料时对标点进行分级利用及采用“粘连”剔除策略能够 明显提升 模型性能,该方法能够获得较好的识别效果。

关键词: 无标注语料,韵律短语边界,最大熵(ME),互信息

Abstract: Obtaining large-scale annotated corpus manually is very difficult and has some disadvantages.Based on the pause role of punctuation,this paper proposed a prosodic phrase recognition method which uses unlabeled corpus and “adhesion” culling strategy.In the method,punctuation is graded and given different weights when it is used to simulate the prosodic boundaries.For recognizing prosodic phrase boundaries automatically,a max entropy model is constructed based on an unlabeled corpus and a Top-K method is also used.According to the mutual information of two contiguous part of speech tagging,words are bundled into adhesion units and the prosodic boundaries appear in it are eliminated.The experimental results show that hierarchical use of punctuation and “adhesion” culling strategy can improve the performance of the model significantly.The method can obtain better recognition results.

Key words: Unlabeled corpus,Prosodic phrase boundary,Maximum entropy(ME),Mutual information

[1] Qian Yi-li,Xun En-dong.Prediction of Speech Pauses Based on Punctuation Information and Statistical Language Model[J].Pattern Recognition and Artificial Intelligence,2008,21(4):541-545(in Chinese) 钱揖丽,荀恩东.基于标点信息和统计语言模型的语音停顿预测[J].模式识别与人工智能,2008,1(4):541-545
[2] Cao Jian-fen.Prediction of Prosodic Organization Based on Gram-matical Information [J].Journal of Chinese Information Processing,2003,17(3):41-46(in Chinese) 曹剑芬.基于语法信息的汉语韵律结构预测[J].中文信息学报,2003,17(3):41-46
[3] Zheng Min,Cai Lian-hong.Statistical model based on probability frequency for Mandarin prosodic structure prediction[J].Journal of Tsinghua University(Science and Technology),2006,46(1):78-81(in Chinese) 郑敏,蔡莲红.基于概率频度的普通话韵律结构预测统计模型[J].清华大学学报(自然科学版),2006,6(1):78-81
[4] Zhao Sheng,Tao Jian-hua,Cai Lian-hong.Rule-learning Based Prosodic Structure Prediction[J].Journal of Chinese Information Processing ,2002,16(5):30-37(in Chinese) 赵晟,陶建华,蔡莲红.基于规则学习的韵律结构预测[J].中文信息学报,2002,6(5):30-37
[5] Ostendorf M,Veilleux N.A hierarchical stochastic model for automatic prediction of prosodic boundary location[J].Computational Linguistics,1994,20(1):27-54
[6] Atterer M,Klein E.Integrating linguistic and performance-based constraints for assigning phrase breaks[C]∥Proceedings of the 19th international conference on Computational linguistics-Vo-lume 1.Association for Computational Linguistics,2002:1-7
[7] Dong Yuan,Zhou Tao,Dong Cheng-yu,et al.Prosodic Structure Prediction Based on Conditional Random Field Model[J].Journal of Beijing University of Posts and Telecommunications,2009,2(5):36-40(in Chinese) 董远,周涛,董乘宇,等.条件随机场模型在韵律结构预测中的应用[J].北京邮电大学学报,2009,32(5):36-40
[8] Qian Yi-li,Feng Zhi-ru.Identification of Chinese Prosodic Ph-rase Based on Chunk and CRF[J].Journal of Chinese Information Processing,2014,28(5):32-38(in Chinese) 钱揖丽,冯志茹.基于语块和条件随机场(CRFs)的韵律短语识别[J].中文信息学报,2014,8(5):32-38
[9] Wang Yong-xin,Cai Lian-hong.Syntactic Information and Ana-lysis and Prediction of Prosody Structure[J].Journal of Chinese Information Processing,2010,4(1):65-70(in Chinese) 王永鑫,蔡莲红.语法信息与韵律结构的分析与预测[J].中文信息学报,2010,4(1):65-70
[10] Pei Yu-lai,Qiu Jin-ping,Wang Hong-jun,et al.Chinese sentence prosodic structure prediction based on the sequence of the parts of Speech[J].Journal of Tsinghua University(Science and Technology),2009(S1):1339-1343(in Chinese) 裴雨来,邱金萍,王洪君,等.基于词类序列的汉语语句韵律结构预测[J].清华大学学报(自然科学版),2009(S1):1339-1343
[11] Yang Hong-wu,Wang Xiao-li,Chen Long,et al.Predicting Chinese prosodic phrase with height of syntax tree[J].Computer Engineering and Applications,2010,6(36):139-143(in Chinese) 杨鸿武,王晓丽,陈龙,等.基于语法树高度的汉语韵律短语预测[J].计算机工程与应用,2010,6(36):139-143
[12] Yang Chen-yu,Zhu Li-xin,Ling Zhen-hua,et al.AutomaticPhrase boundary labeling for a Mandarin TTS corpus using the Viterbi decoding algorithm[J].Journal of Tsinghua University(Science and Technology),2011,1(9):1267-1281(in Chinese) 杨辰雨,朱立新,凌震华,等.基于Viterb解码的中文合成音库韵律短语边界自动标注[J].清华大学学报(自然科学版),2011,1(9):1276-1281
[13] Li Jian-feng,Hu Guo-ping,Wang Ren-hua.New Prosody Ph- rase Prediction Model Based on Whole Sentence Similarity Computing[J].Journal of Chinese Computer Systems,2006,7(10):1935-1938(in Chinese) 李剑锋,胡国平,王仁华.基于整句相似性计算的韵律短语预测模型[J].小型微型计算机系统,2006,7(10):1935-1938
[14] Dong Hong-hui,Tao Jian-hua,Xu Bo.Chinese Prosodic Phrasing with a Constraint based Approach[J].Journal of Chinese Information Processing,2007,1(1):54-59(in Chinese) 董宏辉,陶建华,徐波.基于约束模型的韵律短语预测[J].中文信息学报,2007,21(1):54-59
[15] Shao Yan-qiu,Hui Zhi-fang,Han Ji-qing,et al.A Study on Chinese Prosodic Hierarchy Prediction Based on Dependency Grammar Analysis[J].Journal of Chinese Information Processing,2008,2(2):116-123(in Chinese) 邵艳秋,穗志方,韩纪庆,等.基于依存句法分析的汉语韵律层级自动预测技术研究[J].中文信息学报,2008,22(2):116-123
[16] Yang Hong-wu,Zhu Ling.Predicting Chinese Prosodic boundary based on syntactic features[J].Journal of Northwest Normal University (Natural Science),2013,9(1):41-45(in Chinese) 杨鸿武,朱玲.基于句法特征的汉语韵律边界预[J].西北师范大学学报(自然科学版),2013,9(1):41-45
[17] Zhang Yuan-ping,Ling Zhen-hua,Dai Li-rong,et al.Improved decision tree based method for English prosodic phrase boundary Prediction[J].Application Research of Computers,2012,9(8):2921-2925(in Chinese) 张元平,凌震华,戴礼荣,等.一种改进的基于决策树的英文韵律短语边界预测方法[J].计算机应用研究,2012,29(8):2921-2925

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!