Computer Science ›› 2015, Vol. 42 ›› Issue (Z6): 79-82.

Previous Articles     Next Articles

Minority Language Websites’ Automatic Identification and Collection

LAN Yi-yong, LIU Hai-feng and YANG Yuan-yuan   

  • Online:2018-11-14 Published:2018-11-14

Abstract: This paper presented features of Chinese minority script collection on websites,analysed the problems of webpage identification of Chinese minority script,and put forward an identification method.Based on this method,we designed a software to identify and collect Chinese minority language script such as:Mongolian,Tibetan,Uyghur,Kazak,Kirgiz,Yi script Tai Lue script,Korean,Russian,Zhuang script and so on.The average correct identification rate reaches above 95%.

Key words: Chinese minority language,Websites,Webpage,Automatic identification,Collection

[1] 金良,散旦玛,玉英.传统蒙古文编码及其应用现状分析[J].语文学刊,2012(7):16-17
[2] Newman P.Foreign language identification:First step in thetranslation process[R].Sandia National Labs.,Albuquerque,NM(USA),1987
[3] Ziegler D.The automatic identification of languages using linguistic recognition signals[D].State University of New York at Buffalo,Buffalo,NY,USA,1992
[4] Cavnar W B,Trenkle J M.N-gram-based text categorization[J].Ann Arbor MI,1994,48113(2):161-175
[5] Dunning T.Statistical identification of language[M].Computing Research Laboratory,New Mexico State University,1994
[6] Sibun P,Reynar J C.Language identification:Examining the issues[Z].1996
[7] Mustonen S.Multiple discriminant analysis in linguistic prob-lems[J].Statistical Methods in Linguistics,1965,4:37-44
[8] Kruengkrai C,Srichaivattana P,Sornlertlamvanich V,et al.Language identification based on string kernels[C]∥Proceedings of the 5th International Symposium on Communications and Infermation Technologies.IEEE,2005
[9] Brown R D.Finding and identifying text in 900+ languages[J].Digital Investigation,2012,9:34-43
[10] Yamaguchi H,Tanaka-Ishii K.Text segmentation by languageusing minimum description length[C]∥Association for Computational Linguistics.2012
[11] Chew Y C,Mikami Y,Nagano R L.Language Identification ofWeb Pages Based on Improved N-gram Algorithm[J].International Journal of Computer Science Issues(IJCSI),2011,8(3):47-58
[12] King B,Abney S.Labeling the languages of words in mixed-language documents using weakly supervised methods[Z].2013
[13] Lui M,Lau J H,Baldwin T.Automatic Detection and Language Identification of Multilingual Documents[Z].2014
[14] 藏文网页及其编码的识别方法[Z].Google Patents,2007
[15] 王思丽.藏文网页自动发现与采集技术研究[D].兰州:西北民族大学,2010
[16] 王睿.蒙古文网页抓取及编码识别转换研究[D].呼和浩特:内蒙古大学,2008
[17] 自动识别网页中维吾尔文的方法及其系统[Z].Google Patents,2010
[18] 买日旦·吾守尔,维尼拉·木沙江.电子词典软件系统中对维、哈、柯文进行自动判别技术的研究[J].新疆大学学报:自然科学版,2011(01):88-92

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!