计算机科学 ›› 2015, Vol. 42 ›› Issue (6): 228-232.doi: 10.11896/j.issn.1002-137X.2015.06.048

• 人工智能 • 上一篇    下一篇

基于最近邻的主动学习分词方法

梁喜涛,顾磊   

  1. 南京邮电大学计算机学院 南京210003,南京邮电大学计算机学院 南京210003
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61302157),教育部人文社会科学研究青年基金(12YJC870008),江苏省教育厅高校哲学社会科学基金(2013SJB870004),江苏省社科研究文化精品课题(12SWC-030)资助

Active Learning in Chinese Word Segmentation Based on Nearest Neighbor

LIANG Xi-tao and GU Lei   

  • Online:2018-11-14 Published:2018-11-14

摘要: 分词是中文自然语言处理中的一项关键基础技术。为了解决训练样本不足以及获取大量标注样本费时费力的问题,提出了一种基于最近邻规则的主动学习分词方法。使用新提出的选择策略从大量无标注样本中选择最有价值的样本进行标注,再把标注好的样本加入到训练集中,接着使用该集合来训练分词器。最后在PKU数据集、MSR数据集和山西大学数据集上进行测试,并与传统的基于不确定性的选择策略进行比较。实验结果表明,提出的最近邻主动学习方法在进行样本选择时能够选出更有价值的样本,有效降低了人工标注的代价,同时还提高了分词结果的准确率。

关键词: 中文分词,主动学习,不确定性取样,最近邻规则

Abstract: As the basis of Chinese information processing,Chinese word segmentation(CWS) plays a very important role.To solve the problems of lacking of training samples and accessing a large number of labeled samples laboriously,a fresh active learning method based on nearest neighbor was proposed.The method adopts CRFs as the basic framework and uses the proposed active learning sampling strategy to select the most useful instances to annotate from a large number of unlabeled samples.Next the annotated are put instances into the labeled set and then the segmenter is trained by using the labeled set.Finally the method was tested in PKU corpora,MSR corpora and shanxi university corpora,and compared with the uncertainty sampling strategy.The experiment result shows that the fresh active learning selection strategy can select more valuable samples,reduce the cost of manual annotation effectively,and improve the accuracy of segmentation.

Key words: Chinese word segmentation,Active learning,Uncertainty sampling,Nearest neighbor rule

[1] Lai Si-wei,Xu Li-heng,Chen Yu-bo.Chinese Word SegmentBased on Character Representation Learning[J].Journal of Chinese Information Processing,2013,7(5):8-14
[2] Liu Kang,Qian Xu,Wang Zi-qiang.Survey on Active Learning Algorithms[J].Computer Engineering and Applications,2013,8(34):1-4
[3] Feng Chong,Chen Zhao-xiong,Huang He-yan.Active Learning in Chinese Word Segmentation Based on Multigram Language Model[J].Journal of Chinese Information Processing,2006,0(1):50-58
[4] Ranganathan K,Iamnitchi A,Foster I.Improving data availability through dynamic model-driven replication in large peer-to-peer communities[C]∥2002 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid.IEEE,2002:376-376
[5] Zhang Jian-pei,Xu Hua.Study and Application of Active Lear-ning with SVM [J].Computer Applications,2004,24(1):1-3
[6] Sassano M.An Empirical Study of Active Learning with SupportVector Machines for Japanese Word Segmentation [C]∥Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2002:505-512
[7] Li Shou-shan,Zhou Guo-dong,Huang Chu-ren.Active Learning for Chinese Word Segmentation[C]∥Proceedings of COLING 2012.Posters,COLING,Mumbai,December 2012:683-692
[8] Settles B.Active learning literature survey [R].Computer Sciences Technical Report 1648.University of Wisconsin-Madison,2009
[9] Lewis D D,Gale W A.A sequential algorithm for training text classifiers[C]∥Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Springer-Verlag.New York,Inc.,1994:3-12
[10] Huang Chu-ren,Yo Ting-shuo,imon P,et al.A Realistic andRobust Model for Chinese Word Segmentation[C]∥Procee-dings of the Conference of Computational Linguistics and Speech Processing(ROCLING-08).2008
[11] Ju Sheng-feng,Wang Zhong-qing,Li Shou-shan,et al.A Comparative Study on Different Active Learning Strategies for Sentiment Classification[C]∥Advances of Computational Linguistics in China(2009-2011).2011:506-511
[12] Song H,Yao T.Active Learning Based Corpus Annotation [C]∥IPS-SIGHAN Joint Conference on Chinese Language Proces-sing.Beijing,China,2010:28-29
[13] Long Jun,Yin Jian-ping,Zhu En,et al.An Active Learning Algorithm by Selecting the Most Possibly Wrong-Predicted Instances[J].Journal of Computer Research and Development,2008,4(3):472-478
[14] Church K W,Hanks P.Word association norms,mutual information and lexicography[J].Computational linguistics,1990,16(1):22-29
[15] Liu Bin,Huang Tie-jun,Cheng Jun,et al.A New Statistical-based Method in Automatic Text Classification[J].Journal of Chinese Information Processing,2002,16(6):18-24
[16] Zhu xiao-juan.The Researeh on Chinese Word SegmentationSystem Based on SVM[D].Central South University,2007
[17] Yan Hui,Zhang Xue-gong,Li Yan-da.Kenal-based maximal-margin clustering algorithm[J].Journal of Tsinghua University(Natural Sciences),2002,2(1):36-38
[18] Dai Y,Loh T E,Khoo C S G.A new statistical formula for Chinese text segmentation incorporating contextual information[C]∥Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1999:82-89
[19] Cover T,Hart P.Nearest Neighbor Pattern Classification[J].IEEE Trans on Information Theory,1967,13(1):21-27
[20] Zhao Ying,Liu Hong-xing,Wang Zhong-yu,et al.An Improved Nearest Neighbor Searching Method for Classification Problems[J].Journal of Nanjing University:Natural Sciences,2009,5(4):455-462
[21] Wang Zhen-yu,Wang Xi-zhao.Active Learning AlgorithmBased on Neighborhood Entropy[J].Pattern Recognition and Artificial Intelligence,2011,24(1):97-102
[22] Carter T.An introduction to information theory and entropy[EB/OL].http://astarte.csustan-edu/tom/ SFICSSS
[23] Bai Long-fei,Wang Wen-jian,Guo Hu-sheng.A noval Support Vector maching Active Learning strategy[J].Journal of Nanjing University:Natural Sciences,2012,48(2):182-189
[24] http://crfpp.sourceforge.net/
[25] Qiu Sha,Wang Fu-yan,Shen Hao-ru,et al.Chinese Named Enti-ty Recognition Based on Part of Speech Feature with Edges [J].Computer Engineering,2012,8(13):128-130
[26] Li Shou-shan,Huang Hu-ren.Word Boundary Decision withCRF for Chinese Word Segmentation [C]∥23th PACLIC.2009:726-732
[27] Xiao Qin,Liang Zong,Wu Yu-qian,et al.CRF-based Experi-ments for Cross-Domain Chinese Word Segmentation at CIPS-SIGHAN-2010 [C]∥Proceedings of CLP 2010.2010
[28] Zhong Ke-li,Zhou Xue,Li Hang-yu,et al.Cascaded ChineseWeibo Segmentation Based on CRFs[C]∥Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing.Tianjin,China,2012:69-73

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!