计算机科学 ›› 2015, Vol. 42 ›› Issue (7): 265-269.doi: 10.11896/j.issn.1002-137X.2015.07.057

• 人工智能 • 上一篇    下一篇

基于位置语言模型的中文信息检索系统的研究

陈雅兰,胡小华,涂新辉,何婷婷   

  1. 华中师范大学计算机学院 武汉430079,华中师范大学计算机学院 武汉430079;德雷塞尔大学信息科学与技术学院 费城19082,华中师范大学计算机学院 武汉430079,华中师范大学计算机学院 武汉430079
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家社会科学基金重大项目(12&2D223),湖北省自然科学基金重点项目(2011CDA034),国家语委“十二五”重点项目(ZDI125-1),国家“十二五”科技支撑计划课题(2012BAK24B01),教育部/国家外国专家局高等学校学科创新引智计划项目(B07042),华中师范大学中央高校基本科研业务费项目(CCNU13A05014,CCNU13C01001,CCNU13F010),国家自然科学基金(61300144)资助

Positional Language Model-based Chinese IR System

CHEN Ya-lan, HU Xiao-hua, TU Xin-hui and HE Ting-ting   

  • Online:2018-11-14 Published:2018-11-14

摘要: 在大多数现有的检索模型中常常忽略了如下事实:一个文档中匹配到的查询词项的近邻性和打分时所基于的段落检索也可以被用来促进文档的打分。受此启发,提出了基于位置语言模型的中文信息检索系统,首先通过定义位置传播数的概念,为每个位置单独地建立语言模型;然后通过引入KL-divergence检索模型,并结合位置语言模型给每个位置单独打分;最后由多参数打分策略得到文档的最终得分。实验中还重点比较了基于词表和基于二元两种中文索引方法在位置语言模型中的检索效果。在标准NTCIR5、NTCIR6测试集上的实验结果表明,该检索方法在两种索引方式上都显著改善了中文检索系统的性能,并且优于向量空间模型、BM25概率模型、统计语言模型。

关键词: 位置语言模型,近邻性,段落检索,传播数

Abstract: In most existing retrieval models,the facts are often overlooked that the proximity of matched query terms in a document and passage retrieval used to score can also be exploited to promote scoring for documents.Inspired by this,a Chinese information retrieval system based on the positional language model was proposed.Firstly,we defined the concept of propagated count to establish a positional language model for each position.Then through combing KL-divergence retrieval model and positional language model,we scored for each individual position.Finally,we scored the document by the multi-parameter strategy.The experiment also focuses on comparing the retrieval effect of the two Chinese indexing approaches named multi character-based and dictionary-based on positional language models.Experiments on standard NTCIR5,NTCIR6 test sets show that the performance of the two indexing approaches of IR system improves greatly and it performs better than the vector space model,okapi bm25 model and classical language model.

Key words: Positional language model,Proximity,Passage retrieval,Propagated count

[1] Ponte J,Croft W B.A Language Modeling Approach to Information Retrieval[C]∥Proceedings of the 1998 ACM SIGIR Conference on Research and Development in Information Retrieval.Melbourne,1998:275-281
[2] Lv Yuan-hua,Zhai Cheng-xiang.A comparative study of methodsfor estimating query language models with pseudo feedback[C]∥Proceedings of 2009 CIKM Conference on Information and Knowledge Management.HongKong,2009:1895-1898
[3] Diaz F,Metzler D.Improving the estimation of relevance models using large external corpora[C]∥Proceedings of the 2006 ACM SIGIR Conference on Research and Development in Information Retrieval.Washington,2006:154-161
[4] Liu Xiao-yong,Croft W B.Cluster-based retrieval using lan-guage models[C]∥Proceedings of the 2004 ACM SIGIR Conference on Research and Development in Information Retrieval.Sheffield,2004:186-193
[5] Lv Yuan-hua,Zhai Cheng-xiang.Positional language models for information retrieval[C]∥Proceedings of the 2009 ACM SIGIR Conference on Research and Development in Information Retrieval.Boston,2009:299-306
[6] 余伟,王明文,万剑怡,等.结合语义的位置语言模型[J].北京大学学报(自然科学版),2013,49(2):203-212 Yu Wei,Wang Ming-wen,Wan Jian-yi,et al.Positional language models with semantic information[J].Journal of Peking University(Natural Science Edition),2013,49(2):203-212
[7] Miao Jun,Huang Xiang-ji,Ye Zheng.Proximity-based rocchio’s model for pseudo relevance[C]∥Proceedings of the 2012 ACM SIGIR Conference on Research and Development in Information Retrieval.Portland,2012:535-544
[8] Lv Yuan-hua,Zhai Cheng-xiang.Positional relevance model for pseudo-relevance feedback[C]∥Proceedings of the 2010 ACM SIGIR Conference on Research and Development in Information Retrieval.Geneva,2010:579-586
[9] Kwok K L.Comparing representations in Chinese informationretrieval[C]∥Proceedings of the 1997 ACM SIGIR Conference on Research and Development in Information Retrieval.1997:34-41
[10] Lam W,Wong C Y,Wong K F.Performance evaluation of chara-cter,word and n-gram-based indexing for Chinese text retrieval[C]∥Proceedings of the Information Retrieval with Asian Languages 97 Conference.1997:68-80
[11] Nie J Y,Ren F.Chinese information retrieval:using characters or words[J].Information Processing and Management,1997,35(4):443-462
[12] Zhai Cheng-xiang,Lafferty J D.A study of smoothing methods for language models applied to ad hoc information retrieval[C]∥Proceedings of the 2001 ACM SIGIR Conference on Research and Development in Information Retrieval.New Orleans,2001:334-342
[13] Zhao Jia-shu,Huang Xiang-ji,He Ben.CRTER:using cross termsto enhance probabilistic information retrieval[C]∥Proceedings of the 2011 ACM SIGIR Conference on Research and Development in Information Retrieval.Beijing,2011:155-164
[14] Kise K,Junker M,Dengel A,et al.Passage Retrieval Based on Density Distributions of Terms and Its Applications to Document Retrieval and Question Answering[M].Reading and Learning:Adaptive Content Recognition.2004:306-327
[15] Petkova D,Croft W B.Proximity-based document representation for named entity retrieval[C]∥Proceedings of the 2007 CIKM Conference on Information and Knowledge Management.Lisboa,2007:731-740
[16] Kaszkiel M,Zobel J,Sacks-Davis R.Efficient passage ranking for document databases[J].ACM Transactions on Information Systems,1999,17(4):406-439
[17] Salton G,Wong A,Yang C S.A vector space model for automaticindexing[J].Communications of the ACM,1975,18(11):613-620
[18] Salton G,Fox E A,Wu H.Extended Boolean information retrieval[J].Communications of the ACM,1983,26(11):1022-1036
[19] Maron M E,Kuhns J L.On relevance,probabilistic indexing and information retrieval[J].Journal of the ACM(JACM),1960,7(3):216-244
[20] Berger A,Lafferty J.Information retrieval as statistical translation[C]∥Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval.Berkley,1999:222-229

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!