基于百度百科的词语相似度计算

计算机科学 ›› 2013, Vol. 40 ›› Issue (6): 199-202.

基于百度百科的词语相似度计算

詹志建,梁丽娜,杨小平

中国人民大学信息学院北京100872;中国人民大学信息学院北京100872;中国人民大学信息学院北京100872

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
本文受国家自然科学基金(70871115)资助

Word Similarity Measurement Based on BaiduBaike

ZHAN Zhi-jian,LIANG Li-na and YANG Xiao-ping

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 词语相似度计算是自然语言处理的关键技术之一,是一个被广泛研究的基础课题。传统的词语相似度量方法大多是基于语义知识和基于语料库统计的方法,即这两类方法需要具有层次关系组织的语义词典和大规模的语料库。提出了一种新的基于百度百科的词语相似度量方法,通过分析百度百科词条信息,从表征词条的解释内容方面综合分析词条相似度,并定义了词条间的相似度计算公式,通过计算部分之间的相似度得到整体的相似度。实验结果表明,与已有的相似度计算方法对比,提出的算法更加有效合理。

关键词: 词语相似度,语言网络,百度百科,向量空间模型

Abstract: Research on word similarity measurement has been popular not only in natural language processing but also in other basic research．Traditional word similarity measurements use semantic lexical or large-scale corpus．We first discussed the background of the applications of word similarity measurement,such as information retrieval,information extraction,text classification,example-based machine translation,etc．Then two strategies of word similarity measurement were summarized:one is based on ontology or a semantic taxonomy,the other is based on large collocations of words in corpus．BaiduBaike,an online open encyclopedia,could be used not only as a corpus but also a knowledge resource with rich semantic information．Based on BaiduBaike with its rich semantic information and category graph,we proposed a new method to analyze and compute Chinese word similarity from four dimensions:the baike card,the content of word,the open classification of word and the correlation words．We used language-network to choose top key terms of content of word．Based on vector space mode (VSM) theory,we calculated the similarity between parts of words．We presented a new “multi-path searching” algorithm on BaiduBaike category graph．A comprehensive similarity measuring method based on the four parts was proposed．Experiment results show that the method has a good performance.

Key words: Word similarity,Language network,BaiduBaike,VSM

詹志建,梁丽娜,杨小平. 基于百度百科的词语相似度计算[J]. 计算机科学, 2013, 40(6): 199-202. https://doi.org/

ZHAN Zhi-jian,LIANG Li-na and YANG Xiao-ping. Word Similarity Measurement Based on BaiduBaike[J]. Computer Science, 2013, 40(6): 199-202. https://doi.org/

参考文献

[1] 章志凌,虞立群,陈奕秋,等．基于Corpus库的词语相似度计算方法[J]．计算机应用,2006,26(3):638-640,4
[2] Salton G,Lesk M E．Computer evaluation of indexing and text processing[J]．Journal of the ACM,1968,15(1):8-36
[3] Rada R．Development and application of a metric on semantic nets[J]．IEEE Transactions on System．Man and Cybernetics,1989,19(1):17-30
[4] Lee J H．Information retrieval based on conceptual distance in ISA hierarchies [J]．Journal of Documentation,1993,49(2):188-207
[5] Agirre E,Rigau G．A Proposal for word sense disambiguation using conceptual distance [C]∥International Conference/Recent Advances in Natural Language Recessing RANLP．95．Tzigov Chark,Bulgaria,1995:91-98
[6] Sussna M．Word sense disambiguation for free-text indexing using a massive semantic network[C]∥Proceedings of the 2^nd International Conference on Information and Knowledge Management (CIKM’93)．Washington,DC,US,1993:67-74
[7] 刘群,李素建．基于《知网》的词汇语义相似度计算[C]∥台北第三届汉语词汇语义学研讨会
[8] 王斌．汉英双语语料库自动对齐研究[D]．北京:中国科学院计算技术研究所,1999
[9] Li Su-jian,et al．Semantic computation in Chinese question-an-swering system [J]．Journal of Computer Science and Technology,2002,17(6):933-939
[10] Brown P．Word sense disambiguation using tactical methods[C]∥Proceedings of 29^th Meeting of the Association For Computational Linguistics (ACL291)．1991:210-207
[11] 胡俊峰,俞士汶．唐宋诗词汇间语义相似度计算[J]．中文信息学报,2002(4):40-45
[12] Ferreri Cancho R,Sole R V．The small world of human language[J]．Biological Sciences,2001,268(1482):2261-2265
[13] Seco N,Veale T,Hayes J．An Intrinsic Information ContentMetric for Semantic Similarity in WordNet[C]∥Proc of ECAI．2004
[14] 黄承慧,印鉴,候昉,等.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]．计算机学报,2011(5):856-864
[15] 郑家恒,卢娇丽,等.关键词抽取方法的研究[J]．计算机工程,2005(9):194-196

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed