Computer Science ›› 2018, Vol. 45 ›› Issue (7): 186-189.doi: 10.11896/j.issn.1002-137X.2018.07.032

• Artificial Intelligence • Previous Articles     Next Articles

Jaccard Text Similarity Algorithm Based on Word Embedding

TIAN Xing, ZHENG Jin, ZHANG Zu-ping   

  1. School of Information Science and Engineering,Central South University,Changsha 410083,China
  • Received:2017-03-14 Online:2018-07-30 Published:2018-07-30

Abstract: Based on the research and improvement of the traditional Jaccard algorithm,this paper proposed a Jaccard sentence similarity algorithm based on word embedding.Traditional Jaccard algorithm is characterized by literals of the sentence,so it is restricted in the respect of semantic similarity calculation.While with the rapid development of deep learning,especially the proposal of word embedding,there is a breakthrough on the expression of words in computer.This algorithm firstly maps each word into a high-dimensional vector on semantic level by training,and then calculates the similarity between the respective word vector.The results which are higher than the threshold α are regarded as the intersection,and finally the sentence similarity is calculated.Experiments show that the algorithm significantly improves the accuracy of short text similarity calculation comparing with traditional Jaccard algorithm.

Key words: Jaccard algorithm, Text similarity, Word embedding

CLC Number: 

  • TP391.1
[1]ACHANANUPARP P,HU X,SHEN X.The Evaluation ofSentence Similarity Measures[C]∥International Conference on Data Warehousing and Knowledge Discovery.2008:305-316.
[2]METZLER D,DUMAIS S,MEEK C.Similarity Measures forShort Segments of Text[C]∥Advances in Information Retrie-val,European Conference on Ir Research(ECIR 2007).Rome,Italy,2007:16-27.
[3]LI Y,MCLEAN D,BANDAR Z A,et al.Sentence SimilarityBased on Semantic Nets and Corpus Statistics[J].IEEE Tran-sactions on Knowledge & Data Engineering,2006,18(8):1138-1150.
[4]AGIRRE E,ALFONSECA E,LACALLE O L D.Approxima-ting hierarchy-based similarity for WordNet nominal synsets using topic signatures[C]∥Proceedings of Gwc.2004.
[5]ZHANG H J,WANG G S,ZHONG Y X.Text Similarity Computing Based on Hamming Distance[J].Computer Engineering and Applications,2001,37(19):21-22.(in Chinese)
张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22.
[6]GUO Q L,LI Y M,TANG Q.Similarity computing of docu-ments based on VSM[J].Application Research of Computers,2008,25(11):3256-3258.(in Chinese)
郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008,25(11):3256-3258.
[7]LIAO K J,YANG B B.Similarity Computing of DocumentsBased on Weighted Semantic Network[J].Journal of Intelligence,2012,31(7):182-186.(in Chinese)
廖开际,杨彬彬.基于加权语义网的文本相似度计算的研究[J].情报杂志,2012,31(7):182-186.
[8]LIAO Z F,QIU L X,XIE Y S,et al.A Frequency Enhanced Algorithm of Sentence Semantic Similarity[J].Journal of Hunan University(Natural Sciences),2013,40(2):82-88.(in Chinese)
廖志芳,邱丽霞,谢岳山,等.一种频率增强的语句语义相似度计算[J].湖南大学学报(自然科学版),2013,40(2):82-88.
[9]LIAO Z F,ZHOU G E,LI J F,et al.A Chinese Short Text Similarity Algorithm Based on Semantic and Syntax[J].Journal of Hunan University(Natural Sciences),2016,43(2):135-140.(in Chinese)
廖志芳,周国恩,李俊锋,等.中文短文本语法语义相似度算法[J].湖南大学学报(自然科学版),2016,43(2):135-140.
[10]BENGIO Y,SCHWENK H,SEN CAL J S,et al.A neuralprobabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[11]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (almost) from Scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
[12]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[13]HUANG E H,SOCHER R,MANNING C D,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.2012:873-882.
[14]NG J P,ABRECHT V.Better Summarization Evaluation with Word Embeddings for ROUGE[C]∥Proceedings of the 2015 Conference on Empirical Methods in Natural Language Proces-sing.2015.
[15]KUSNER M J,SUN Y,KOLKIN N I,et al.From Word Embeddings to Document Distances[C]∥International Conference on Mechine Learning.2015:957-966.
[1] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[2] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[3] LI Yu-qiang, ZHANG Wei-jiang, HUANG Yu, LI Lin, LIU Ai-hua. Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution [J]. Computer Science, 2022, 49(2): 256-264.
[4] LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[5] LI Zhao-qi, LI Ta. Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining [J]. Computer Science, 2022, 49(1): 59-64.
[6] YU Sheng, LI Bin, SUN Xiao-bing, BO Li-li, ZHOU Cheng. Approach for Knowledge-driven Similar Bug Report Recommendation [J]. Computer Science, 2021, 48(5): 91-98.
[7] HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[8] ZHANG Yu-shuai, ZHAO Huan, LI Bo. Semantic Slot Filling Based on BERT and BiLSTM [J]. Computer Science, 2021, 48(1): 247-252.
[9] TIAN Ye, SHOU Li-dan, CHEN Ke, LUO Xin-yuan, CHEN Gang. Natural Language Interface for Databases with Content-based Table Column Embeddings [J]. Computer Science, 2020, 47(9): 60-66.
[10] CHENG Jing, LIU Na-na, MIN Ke-rui, KANG Yu, WANG Xin, ZHOU Yang-fan. Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification [J]. Computer Science, 2020, 47(8): 255-260.
[11] LI Zhou-jun,FAN Yu,WU Xian-jie. Survey of Natural Language Processing Pre-training Techniques [J]. Computer Science, 2020, 47(3): 162-173.
[12] GU Xue-mei,LIU Jia-yong,CHENG Peng-sen,HE Xiang. Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model [J]. Computer Science, 2020, 47(2): 245-250.
[13] JIA Jing-dong, ZHANG Xiao-man, HAO Lu, TAN Huo-bin. Analysis of Focuses of Requirements Engineering in Industry [J]. Computer Science, 2020, 47(12): 25-34.
[14] HUO Dan, ZHANG Sheng-jie, WAN Lu-jun. Context-based Emotional Word Vector Hybrid Model [J]. Computer Science, 2020, 47(11A): 28-34.
[15] XU Sheng, ZHU Yong-xin. Study on Question Processing Algorithms in Visual Question Answering [J]. Computer Science, 2020, 47(11): 226-230.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!