少数民族语言文字网站的自动识别和采集

计算机科学 ›› 2015, Vol. 42 ›› Issue (Z6): 79-82.

少数民族语言文字网站的自动识别和采集

兰义湧,刘海峰,杨媛媛

中央民族大学理学院北京100081,中央民族大学信息工程学院北京100081,中央民族大学少数民族语言文学系北京100081

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受中央民族大学2014年校级自主科研项目(2014MDLXYZY04)资助

Minority Language Websites’ Automatic Identification and Collection

LAN Yi-yong, LIU Hai-feng and YANG Yuan-yuan

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 分析了少数民族语言文字网站的特殊性,综合采用基于特殊字符、网页标签属性和N-gram的方法对传统蒙古文、藏文、阿拉伯字母体系的维吾尔文、哈萨克文和柯尔克孜文以及彝文、新傣文、朝鲜文、俄文和壮文等10种少数民族语言文字网站进行了自动识别研究。所提方法对10种少数民族语言文字网站的平均正确识别率达到95%以上,效果令人满意。

Abstract: This paper presented features of Chinese minority script collection on websites,analysed the problems of webpage identification of Chinese minority script,and put forward an identification method.Based on this method,we designed a software to identify and collect Chinese minority language script such as:Mongolian,Tibetan,Uyghur,Kazak,Kirgiz,Yi script Tai Lue script,Korean,Russian,Zhuang script and so on.The average correct identification rate reaches above 95%.

Key words: Chinese minority language,Websites,Webpage,Automatic identification,Collection

兰义湧,刘海峰,杨媛媛. 少数民族语言文字网站的自动识别和采集[J]. 计算机科学, 2015, 42(Z6): 79-82. https://doi.org/

LAN Yi-yong, LIU Hai-feng and YANG Yuan-yuan. Minority Language Websites’ Automatic Identification and Collection[J]. Computer Science, 2015, 42(Z6): 79-82. https://doi.org/

参考文献

[1] 金良,散旦玛,玉英.传统蒙古文编码及其应用现状分析[J].语文学刊,2012(7):16-17
[2] Newman P.Foreign language identification:First step in thetranslation process[R].Sandia National Labs.,Albuquerque,NM(USA),1987
[3] Ziegler D.The automatic identification of languages using linguistic recognition signals[D].State University of New York at Buffalo,Buffalo,NY,USA,1992
[4] Cavnar W B,Trenkle J M.N-gram-based text categorization[J].Ann Arbor MI,1994,48113(2):161-175
[5] Dunning T.Statistical identification of language[M].Computing Research Laboratory,New Mexico State University,1994
[6] Sibun P,Reynar J C.Language identification:Examining the issues[Z].1996
[7] Mustonen S.Multiple discriminant analysis in linguistic prob-lems[J].Statistical Methods in Linguistics,1965,4:37-44
[8] Kruengkrai C,Srichaivattana P,Sornlertlamvanich V,et al.Language identification based on string kernels[C]∥Proceedings of the 5th International Symposium on Communications and Infermation Technologies.IEEE,2005
[9] Brown R D.Finding and identifying text in 900+ languages[J].Digital Investigation,2012,9:34-43
[10] Yamaguchi H,Tanaka-Ishii K.Text segmentation by languageusing minimum description length[C]∥Association for Computational Linguistics.2012
[11] Chew Y C,Mikami Y,Nagano R L.Language Identification ofWeb Pages Based on Improved N-gram Algorithm[J].International Journal of Computer Science Issues(IJCSI),2011,8(3):47-58
[12] King B,Abney S.Labeling the languages of words in mixed-language documents using weakly supervised methods[Z].2013
[13] Lui M,Lau J H,Baldwin T.Automatic Detection and Language Identification of Multilingual Documents[Z].2014
[14] 藏文网页及其编码的识别方法[Z].Google Patents,2007
[15] 王思丽.藏文网页自动发现与采集技术研究[D].兰州:西北民族大学,2010
[16] 王睿.蒙古文网页抓取及编码识别转换研究[D].呼和浩特:内蒙古大学,2008
[17] 自动识别网页中维吾尔文的方法及其系统[Z].Google Patents,2010
[18] 买日旦·吾守尔,维尼拉·木沙江.电子词典软件系统中对维、哈、柯文进行自动判别技术的研究[J].新疆大学学报:自然科学版,2011(01):88-92

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed