Computer Science ›› 2015, Vol. 42 ›› Issue (2): 217-223.doi: 10.11896/j.issn.1002-137X.2015.02.045

Previous Articles     Next Articles

Key Retrieval Technologies in Large-scale Chinese Corpus

YU Yi-jiao and LIU Qin   

  • Online:2018-11-14 Published:2018-11-14

Abstract: The query requirements of large-scale Chinese corpora are different from those of general text retrieval system.Cici v2.0 is a Chinese corpus search system and provides linguistic query services:part-of-speech search,reduplicated words search,wildcard search,and Chinese N-gram string occurrence search.The N-gram string occurrences are accounted and indexed by Unicode and frequency respectively.The search procedure is divided into three steps.First,the Chinese N-gram occurrence statistic records are searched and the candidate n-gram strings are produced.Then,keywords are searched according to user’s linguistic need;At last,these Chinese strings are searched selected by users in the corpora and the final results are returned.

Key words: Chinese character,Corpus,Information retrieval,Part-of-speech,N-gram

[1] Ruslan M.The Oxford Handbook of Computational Linguistics [M].Beijing:Foreign Language Teaching and Research Press,2009
[2] 邱晗,周强.自动获取大规模的汉语紧密组合词汇关联对[J].清华大学学报:自然科学版,2011(9):28-33
[3] 余一骄,刘芹.面向超大规模的中文文本N-gram串统计[J].计算机科学,2014(4):263-268
[4] 罗琭昕.用统计的方法看“京派”与“海派”小说语言风格差异[J].现代语文:学术综合版,2012(4):137-141
[5] 陈功.语料库检索的模式、问题及启示[J].当代外语研究,2011(10):10-14
[6] 任海波.现代汉语AABB重叠式词构成基础的统计分析[J].中国语文,2001(4):302-308
[7] 崔四行.从ABAB、AABB重音模式的句法功能看汉语的韵律形态[J].语言教学与研究,2012(5):63-69
[8] 蒋向勇,白解红.汉语ABB式网络重叠词语的认知解读[J].外语研究,2013(3):30-34
[9] 邢福义.“X以上”格式在现代汉语中的演进[J].语言研究,2010(1):1-10
[10] 洪涛,董正存.“非X不可”的历史演化和语法化[J].中国语文,2004(3):253-261
[11] 邵敬敏.说“V一把”中V的泛化与“一把”的词汇化[J].中国语文,2007(1):14-19
[12] 李广瑜.否定祈使句式“别V着”刍议[J].语言教学与研究,2013(1):48-55
[13] 吴为善,夏芳芳.“A不到哪里去”的构式解析、话语功能及其成因[J].中国语文,2011(4):326-333
[14] 张谊生.“N”+“们”的选择限制与“N们”的表义功用[J].中国语文,2001(3):201-211
[15] http://ccl.pku.edu.cn:8080
[16] http://democlip.blcu.edu.cn:800
[17] 王惠.词义·词长·词频—《现代汉语词典》(第5版)多义词计量分析[J].中国语文,2009(5):120-130
[18] 张宝林.汉语中介语语料库建设的现状与对策[J].语言文字应用,2010(3):129-138
[19] Zhang Hua-ping,Yu Hong-kui,Xiong De-yi,et al.HHMM-based Chinese Lexical Analyzer ICTCLAS [C]∥Proceedings of 2nd SIGHAN Workshop Affiliated with 41st ACL.2003:184-187

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!