计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 186-189.doi: 10.11896/j.issn.1002-137X.2018.07.032

• 人工智能 • 上一篇    下一篇

基于词向量的Jaccard相似度算法

田星,郑瑾,张祖平   

  1. 中南大学信息科学与工程学院 长沙410083
  • 收稿日期:2017-03-14 出版日期:2018-07-30 发布日期:2018-07-30
  • 作者简介:田 星(1993-),男,硕士生,主要研究方向为机器学习、自然语言处理,E-mail:grubbyskyer@qq.com;郑 瑾(1970-),女,副教授,主要研究方向为软件工程、机器学习,E-mail:zhengjin@csu.edu.cn;张祖平(1966-),男,教授,主要研究方向为软件工程、数据挖掘、信息检索等。
  • 基金资助:
    本文受国家自然科学基金(61379109)资助。

Jaccard Text Similarity Algorithm Based on Word Embedding

TIAN Xing, ZHENG Jin, ZHANG Zu-ping   

  1. School of Information Science and Engineering,Central South University,Changsha 410083,China
  • Received:2017-03-14 Online:2018-07-30 Published:2018-07-30

摘要: 通过对传统Jaccard算法的研究和改进,提出了一种基于词向量的Jaccard句子相似度算法。传统的Jaccard算法以句子的字面量为特征,因而在语义层面的相似度计算方面受到了一定的限制。而随着深度学习的兴起,尤其是词向量的提出,词语在计算机中的表示有了突破性的进展。该算法首先通过训练将每个词语映射为语义层面的高维向量,然后计算各个词向量之间的相似度,高于阈值α的作为共现部分,最终计算句子的相似度。实验表明,相较于传统的Jaccard算法,该算法在短文本相似度计算的准确率上有较明显的提升。

关键词: Jaccard算法, 词向量, 句子相似度

Abstract: Based on the research and improvement of the traditional Jaccard algorithm,this paper proposed a Jaccard sentence similarity algorithm based on word embedding.Traditional Jaccard algorithm is characterized by literals of the sentence,so it is restricted in the respect of semantic similarity calculation.While with the rapid development of deep learning,especially the proposal of word embedding,there is a breakthrough on the expression of words in computer.This algorithm firstly maps each word into a high-dimensional vector on semantic level by training,and then calculates the similarity between the respective word vector.The results which are higher than the threshold α are regarded as the intersection,and finally the sentence similarity is calculated.Experiments show that the algorithm significantly improves the accuracy of short text similarity calculation comparing with traditional Jaccard algorithm.

Key words: Jaccard algorithm, Text similarity, Word embedding

中图分类号: 

  • TP391.1
[1]ACHANANUPARP P,HU X,SHEN X.The Evaluation ofSentence Similarity Measures[C]∥International Conference on Data Warehousing and Knowledge Discovery.2008:305-316.
[2]METZLER D,DUMAIS S,MEEK C.Similarity Measures forShort Segments of Text[C]∥Advances in Information Retrie-val,European Conference on Ir Research(ECIR 2007).Rome,Italy,2007:16-27.
[3]LI Y,MCLEAN D,BANDAR Z A,et al.Sentence SimilarityBased on Semantic Nets and Corpus Statistics[J].IEEE Tran-sactions on Knowledge & Data Engineering,2006,18(8):1138-1150.
[4]AGIRRE E,ALFONSECA E,LACALLE O L D.Approxima-ting hierarchy-based similarity for WordNet nominal synsets using topic signatures[C]∥Proceedings of Gwc.2004.
[5]ZHANG H J,WANG G S,ZHONG Y X.Text Similarity Computing Based on Hamming Distance[J].Computer Engineering and Applications,2001,37(19):21-22.(in Chinese)
张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22.
[6]GUO Q L,LI Y M,TANG Q.Similarity computing of docu-ments based on VSM[J].Application Research of Computers,2008,25(11):3256-3258.(in Chinese)
郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008,25(11):3256-3258.
[7]LIAO K J,YANG B B.Similarity Computing of DocumentsBased on Weighted Semantic Network[J].Journal of Intelligence,2012,31(7):182-186.(in Chinese)
廖开际,杨彬彬.基于加权语义网的文本相似度计算的研究[J].情报杂志,2012,31(7):182-186.
[8]LIAO Z F,QIU L X,XIE Y S,et al.A Frequency Enhanced Algorithm of Sentence Semantic Similarity[J].Journal of Hunan University(Natural Sciences),2013,40(2):82-88.(in Chinese)
廖志芳,邱丽霞,谢岳山,等.一种频率增强的语句语义相似度计算[J].湖南大学学报(自然科学版),2013,40(2):82-88.
[9]LIAO Z F,ZHOU G E,LI J F,et al.A Chinese Short Text Similarity Algorithm Based on Semantic and Syntax[J].Journal of Hunan University(Natural Sciences),2016,43(2):135-140.(in Chinese)
廖志芳,周国恩,李俊锋,等.中文短文本语法语义相似度算法[J].湖南大学学报(自然科学版),2016,43(2):135-140.
[10]BENGIO Y,SCHWENK H,SEN CAL J S,et al.A neuralprobabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[11]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (almost) from Scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
[12]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[13]HUANG E H,SOCHER R,MANNING C D,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.2012:873-882.
[14]NG J P,ABRECHT V.Better Summarization Evaluation with Word Embeddings for ROUGE[C]∥Proceedings of the 2015 Conference on Empirical Methods in Natural Language Proces-sing.2015.
[15]KUSNER M J,SUN Y,KOLKIN N I,et al.From Word Embeddings to Document Distances[C]∥International Conference on Mechine Learning.2015:957-966.
[1] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[2] 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波.
语义通信系统的性能度量指标分析
Analysis of Performance Metrics of Semantic Communication Systems
计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[3] 黄少滨, 孙雪薇, 李熔盛.
基于跨句上下文信息的神经网络关系分类方法
Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network
计算机科学, 2022, 49(6A): 119-124. https://doi.org/10.11896/jsjkx.210600150
[4] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[5] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[6] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[7] 杨进才, 曹元, 胡泉, 沈显君.
基于Transformer模型与关系词特征的汉语因果类复句关系自动识别
Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature
计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019
[8] 杨青, 张亚文, 朱丽, 吴涛.
基于注意力机制和BiGRU融合的文本情感分析
Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU
计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075
[9] 张玉帅, 赵欢, 李博.
基于BERT和BiLSTM的语义槽填充
Semantic Slot Filling Based on BERT and BiLSTM
计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088
[10] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[11] 李舟军,范宇,吴贤杰.
面向自然语言处理的预训练技术研究综述
Survey of Natural Language Processing Pre-training Techniques
计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167
[12] 霍丹, 张生杰, 万路军.
基于上下文的情感词向量混合模型
Context-based Emotional Word Vector Hybrid Model
计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114
[13] 景丽, 李曼曼, 何婷婷.
结合扩充词典与自监督学习的网络评论情感分类
Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning
计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061
[14] 杨丹浩,吴岳辛,范春晓.
一种基于注意力机制的中文短文本关键词提取模型
Chinese Short Text Keyphrase Extraction Model Based on Attention
计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261
[15] 王乐乐,汪斌强,刘建港,张建辉,苗启广.
基于递归神经网络的恶意程序检测研究
Study on Malicious Program Detection Based on Recurrent Neural Network
计算机科学, 2019, 46(7): 86-90. https://doi.org/10.11896/j.issn.1002-137X.2019.07.013
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!