计算机科学 ›› 2015, Vol. 42 ›› Issue (1): 261-267.doi: 10.11896/j.issn.1002-137X.2015.01.058

• 人工智能 • 上一篇    下一篇

基于搜索引擎的词汇语义相似度计算方法

陈海燕   

  1. 华东政法大学计算机科学与技术系 上海201620
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家社会科学基金项目(06BFX051),上海高校选拔培养优秀青年教师科研专项基金(hzf05046)资助

Measuring Semantic Similarity between Words Using Web Search Engines

CHEN Hai-yan   

  • Online:2018-11-14 Published:2018-11-14

摘要: 词汇语义相似度的计算在网页浏览和查询推荐等网络相关工作中起着重要的作用。传统的基于分类的方法不能处理持续出现的新词。由于网络数据中隐藏着大量的噪音和冗余,鲁棒性和准确性仍然是一个挑战,因此提出了一种基于搜索引擎的词汇语义相似度计算方法。语义片段和检索结果的页数被用来去除词汇语义相似度计算过程中的噪音和冗余。此外,还提出了一种方法来整合查询结果页数、语义片段和显示的搜索结果的数量,该方法不需要任何先验知识与本体。实验结果显示,所提出的方法在Rubenstein-Goodenough测试集的相关系数为0.851,优于现有的基于网络的词汇语义相似度计算方法,同时在搜索引擎的查询扩展任务中具有较为良好的应用效果。

关键词: 语义相似度,信息检索,查询建议,网络检索

Abstract: Semantic similarity measures play important roles in many Web-related tasks such as Web browsing and querysuggestion.Because taxonomy-based methods cannot deal with continually emerging words,recently Web-based methods have been proposed to solve this problem.Because of the noise and redundancy hidden in the Web data,robustness and accuracy are still challenges.We proposed a method integrating page counts and snippets returned by Web search engines.Then,the semantic snippets and the number of search results were used to remove noise and redundancy in the Web snippets.After that,a method integrating page counts,semantics snippets and the number of already displayed search results was proposed.The proposed method does not need any human annotated knowledge,and can be applied Web-related tasks easily.A correlation coefficient of 0.851 against Rubenstein-Goodenough benchmark dataset shows that the proposed method outperforms the existing Web-based methods by a wide margin.Moreover,the proposed semantic similarity measure significantly improves the quality of query suggestion against some page counts based methods.

Key words: Semantic similarity,Information retrieval,Query suggestion,Web search

[1] Resnik P.Semantic similarity in a taxonomy:an informationbased measure and its application to problems of ambiguity in natural language[J].Journal of Artificial Intelligence Research 1999,11:95-130
[2] Luo X,Hu Q,Xu W,et al.Discovery of textual knowledge flow based on the management of knowledge maps[J].Concurrency and Computation:Practice and Experience,2008,20:1791-1806
[3] Luo X,Xu Z,Li Q,et al.Generation of similarity knowledgeflow for intelligent browsing based on semantic link networks[J].Concurrency and Computation:Practice and Experience 2009,21:2018-2032
[4] Luo X,Yu J,Li Q,et al.Building web knowledge flows based on interactive computing with semantics[J].New Generation Computing,2010,28:113-120
[5] Zhang S,Luo X,Chen J,et al.Measuring knowledge delivery quantity of associated knowledge flow[C]∥Proceedings of the Fourth International Conference on Semantics,Knowledge and Grid.IEEE Computer Society:Washington,DC,2008:117-124
[6] Smeulders A,Worring M,Santini S,et al.Content-based image retrieval at the end of the early years[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(12):1349-1380
[7] Srihari R,Zhang Z,Rao A.Intelligent indexing and semantic retrieval of multimodal documents[J].Information Retrieval,2000,2:245-275
[8] Makkonen J,Ahonen-Myka H,Salmenkivi M.Simple semantics in topic detection and tracking[J].Information Retrieval,2004,7:347-368
[9] Green S J.Building hypertext links by computing semantic similarity[J].IEEE Transactions on Knowledge and Data Enginee-ring,1999,11(5):713-730
[10] Vojnovic M,Cruise J,Gunawardena D,et al.Ranking and suggesting popular items[J].IEEE Transactions on Knowledge and Data Engineering,2009,21(8):1133-1146
[11] Cimano P,Handschuh S.Towards the self-annotating web[C]∥Proceedings of the 13th International World Wide Web Confe-rence.ACM Press:New York,2004:462-471
[12] Schenkel R,Theobald A,Weikum G.Semantic similarity search on semistructured data with the XXL search engine[J].Information Retrieval,2005,8:521-545
[13] Resnik P,Smithm A.The Web as a parallel corpus[J].Computational Linguistics 2003,29(3):349-380
[14] Xiao C,Wang W,Lin X,et al.Efficient similarity joins for near duplicate detection[C]∥Proceedings of 17th International World Wide Web Conference.ACM Press:New York,NY,2008:131-140
[15] Richardson R,Smeaton F.Using WordNet in a knowledge-based approach to information retrieval[D].Working Paper,CA-0395,School of Computer Applications,Dublin City University,Ireland,1999
[16] Sussna M.Word sense disambiguation for free-text indexingusing a massive semantic network[C]∥Proceedings of the Se-cond International Conference on Information and Knowledge Management.ACM Press:New York,NY,1993:67-74
[17] Jiang J J,Conrath D W.Semantic similarity based on corpus statistics and lexical taxonomy[C]∥Proceedings of International Conference Research on Computational Linguistics.1997
[18] Herdagdelen A,Erk K.Measuring semantic relatedness withvector space models and random walks[C]∥Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing.2009:50-53
[19] Li Y,Bandar A,McLean D.An approach for measuring semantic similarity between words using multiple information sources[J].IEEE Transaction on Knowledge and Data Engineering,2003,15(4):871-882
[20] Turney P D.Features of similarity[J].Psychological Review,1997,84(4):327-352
[21] Chen H,Lin M,Wei Y.Novel association measures using web search with double checking[C]∥Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics.2006:1009-1016
[22] Sahami M,Heilman D.A Web-based kernel function for measu-ring the similarity of short text snippets[C]∥Proceedings of the 15th International World Wide Web Conference.ACM Press:New York,NY,2006:377-386
[23] Islam A,Inkpen D.Second order co-occurrence PMI for determining the semantic similarity of words[C]∥Proceedings of the International Conference on Language Resources and Evaluation.2006:1033-1038
[24] Bollegala D,Matsuo Y,Ishizuka M.Measuring semantic similari-ty between words using web search engines[C]∥Proceedings of 16th International World Wide Web Conference.ACM Press:New York,NY,2007:757-766
[25] Firth R.A synopsis of linguistic theory 1930-1955[D].Studies in Linguistic Analysis,Philological Society:Oxford,1957
[26] Bayardo R J,Ma Y,Srikant R.Scaling up all pairs similaritysearch[C]∥Proceedings of 16th International World Wide Web Conference.ACM Press:New York,NY,2007:131-140
[27] Rubenstein H,Goodenough B.Contextual correlates of synonymy[J].Communications of the ACM,1965,8(10):627-633
[28] Agrawal R,Imielinski T,Swami A.Mining association rules between sets of items in large databases[C]∥Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data.Vol.22,ACM Press:New York,NY,1993:207-216
[29] Church W,Hanks P.Word association norms,mutual information and lexicography[C]∥Proceedings of the 27th Annual Conference of the Association of Computational Linguistics.1989:76-83

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!