计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 186-189.doi: 10.11896/j.issn.1002-137X.2018.07.032
田星,郑瑾,张祖平
TIAN Xing, ZHENG Jin, ZHANG Zu-ping
摘要: 通过对传统Jaccard算法的研究和改进,提出了一种基于词向量的Jaccard句子相似度算法。传统的Jaccard算法以句子的字面量为特征,因而在语义层面的相似度计算方面受到了一定的限制。而随着深度学习的兴起,尤其是词向量的提出,词语在计算机中的表示有了突破性的进展。该算法首先通过训练将每个词语映射为语义层面的高维向量,然后计算各个词向量之间的相似度,高于阈值α的作为共现部分,最终计算句子的相似度。实验表明,相较于传统的Jaccard算法,该算法在短文本相似度计算的准确率上有较明显的提升。
中图分类号:
[1]ACHANANUPARP P,HU X,SHEN X.The Evaluation ofSentence Similarity Measures[C]∥International Conference on Data Warehousing and Knowledge Discovery.2008:305-316. [2]METZLER D,DUMAIS S,MEEK C.Similarity Measures forShort Segments of Text[C]∥Advances in Information Retrie-val,European Conference on Ir Research(ECIR 2007).Rome,Italy,2007:16-27. [3]LI Y,MCLEAN D,BANDAR Z A,et al.Sentence SimilarityBased on Semantic Nets and Corpus Statistics[J].IEEE Tran-sactions on Knowledge & Data Engineering,2006,18(8):1138-1150. [4]AGIRRE E,ALFONSECA E,LACALLE O L D.Approxima-ting hierarchy-based similarity for WordNet nominal synsets using topic signatures[C]∥Proceedings of Gwc.2004. [5]ZHANG H J,WANG G S,ZHONG Y X.Text Similarity Computing Based on Hamming Distance[J].Computer Engineering and Applications,2001,37(19):21-22.(in Chinese) 张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22. [6]GUO Q L,LI Y M,TANG Q.Similarity computing of docu-ments based on VSM[J].Application Research of Computers,2008,25(11):3256-3258.(in Chinese) 郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008,25(11):3256-3258. [7]LIAO K J,YANG B B.Similarity Computing of DocumentsBased on Weighted Semantic Network[J].Journal of Intelligence,2012,31(7):182-186.(in Chinese) 廖开际,杨彬彬.基于加权语义网的文本相似度计算的研究[J].情报杂志,2012,31(7):182-186. [8]LIAO Z F,QIU L X,XIE Y S,et al.A Frequency Enhanced Algorithm of Sentence Semantic Similarity[J].Journal of Hunan University(Natural Sciences),2013,40(2):82-88.(in Chinese) 廖志芳,邱丽霞,谢岳山,等.一种频率增强的语句语义相似度计算[J].湖南大学学报(自然科学版),2013,40(2):82-88. [9]LIAO Z F,ZHOU G E,LI J F,et al.A Chinese Short Text Similarity Algorithm Based on Semantic and Syntax[J].Journal of Hunan University(Natural Sciences),2016,43(2):135-140.(in Chinese) 廖志芳,周国恩,李俊锋,等.中文短文本语法语义相似度算法[J].湖南大学学报(自然科学版),2016,43(2):135-140. [10]BENGIO Y,SCHWENK H,SEN CAL J S,et al.A neuralprobabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155. [11]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (almost) from Scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537. [12]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119. [13]HUANG E H,SOCHER R,MANNING C D,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.2012:873-882. [14]NG J P,ABRECHT V.Better Summarization Evaluation with Word Embeddings for ROUGE[C]∥Proceedings of the 2015 Conference on Empirical Methods in Natural Language Proces-sing.2015. [15]KUSNER M J,SUN Y,KOLKIN N I,et al.From Word Embeddings to Document Distances[C]∥International Conference on Mechine Learning.2015:957-966. |
[1] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[2] | 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波. 语义通信系统的性能度量指标分析 Analysis of Performance Metrics of Semantic Communication Systems 计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071 |
[3] | 黄少滨, 孙雪薇, 李熔盛. 基于跨句上下文信息的神经网络关系分类方法 Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network 计算机科学, 2022, 49(6A): 119-124. https://doi.org/10.11896/jsjkx.210600150 |
[4] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[5] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[6] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[7] | 杨进才, 曹元, 胡泉, 沈显君. 基于Transformer模型与关系词特征的汉语因果类复句关系自动识别 Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature 计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019 |
[8] | 杨青, 张亚文, 朱丽, 吴涛. 基于注意力机制和BiGRU融合的文本情感分析 Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU 计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075 |
[9] | 张玉帅, 赵欢, 李博. 基于BERT和BiLSTM的语义槽填充 Semantic Slot Filling Based on BERT and BiLSTM 计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088 |
[10] | 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆. 一种低频词词向量优化方法及其在短文本分类中的应用 Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification 计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163 |
[11] | 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述 Survey of Natural Language Processing Pre-training Techniques 计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167 |
[12] | 霍丹, 张生杰, 万路军. 基于上下文的情感词向量混合模型 Context-based Emotional Word Vector Hybrid Model 计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114 |
[13] | 景丽, 李曼曼, 何婷婷. 结合扩充词典与自监督学习的网络评论情感分类 Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning 计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061 |
[14] | 杨丹浩,吴岳辛,范春晓. 一种基于注意力机制的中文短文本关键词提取模型 Chinese Short Text Keyphrase Extraction Model Based on Attention 计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261 |
[15] | 王乐乐,汪斌强,刘建港,张建辉,苗启广. 基于递归神经网络的恶意程序检测研究 Study on Malicious Program Detection Based on Recurrent Neural Network 计算机科学, 2019, 46(7): 86-90. https://doi.org/10.11896/j.issn.1002-137X.2019.07.013 |
|