计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 113-116.

• 智能计算 • 上一篇    下一篇

WDS:基于词向量的文本相似函数

王路琪1, 龙军1, 袁鑫攀2   

  1. 中南大学软件学院 长沙4100751
    湖南工业大学计算机与通信学院 湖南 株洲4120002
  • 出版日期:2019-02-26 发布日期:2019-02-26
  • 通讯作者: 袁鑫攀(1982-),男,博士,讲师,主要研究方向为信息检索、数据挖掘,E-mail:xpyuanfly@163.com
  • 作者简介:王路琪(1990-),男,硕士生,主要研究方向为自然语言处理,E-mail:wangluqinet@163.com;龙 军(1972-),男,博士,教授,博士生导师,主要研究方向为网构化软件
  • 基金资助:
    本文受国家自然科学基金资助项目(61402165,S1651002),湖南省重点研发计划(2016JC2018)资助。

WDS:Word Documents Similarity Based on Word Embedding

WANG Lu-qi1, LONG Jun1, YUAN Xin-pan2   

  1. School of Software,Central South University,Changsha 410075,China1
    School of Computer and Communication,Hunan University of Technology,Zhuzhou,Hunan 412000,China2
  • Online:2019-02-26 Published:2019-02-26

摘要: 为进一步提高文本相似度计算的准确性,在系统相似函数的架构下,提出了基于词向量的文本相似函数WDS(Word Documents Similarity)及其优化算法FWDS(Fast Word Documents Similarity)。该函数将文本词语集合对应的词向量集合看作系统,将词语对应的词向量看作系统的元素,则两个文本相似度就是两个向量集合的相似度。在具体计算时,以第一个向量集合为标准进行两个向量集合的对齐操作,同时计算相似元与非相似元的多个参数。实验结果表明,随着文本长度的增加,与WMD和WJ算法相比,WDS表现出了较高的命中率。

关键词: 词向量, 权值, 文本相似, 系统相似函数, 相似元

Abstract: In order to further improve the accuracy of document similarity,under the framework of system similarity function,this paper presented Word Documents Similarity (WDS) based on word embedding,and its optimization algorithm FWDS (Fast Word Documents Similarity).WDS regards the set of word embedding corresponding to the words set of documents as the system,and regards the word embedding corresponding to the word as the element of the system.So,the similarity of the documents is the similarity of the two word embedding sets.In the concrete calculation,the first vector set is used as the standard,the alignment operation of the two vector sets is carried out,and the multiple parameters of the sets that are in and not in MOPs are calculated.The experimental results show that compared with WMD and WJ,WDS always keep better hit rate with documents’ length increase.

Key words: Document similarity, MOP, System similarity function, Weight, Word embedding

中图分类号: 

  • TP301.6
[1]GOPALAN P,CHARLIN L,BLEI D M.Content-based recommendations with Poisson factorization [J].Advances in Neural Information Processing Systems,2014,4(31):76-84.
[2]MINCHEVA S.FBK-HLT:An Application of Semantic Textual Similarity for Answer Selection in Community Question Answering[C]∥Proceedings of the International Workshop on Semantic Evaluation.2015.
[3]KIM Y.Convolutional neural networks for sentence classification[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1746-1751.
[4]LIN C Y,OCH F J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics [J].Proceedings of Annual Meeting of the Association for Computational Linguistics,2004:605-612.
[5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[C]∥Proceedings of the International Conference on Learning Representations.2013.
[6]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed Representations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26(311):1-9.
[7]徐帅.面向问答系统的复述识别技术研究与实现 [D].哈尔滨:哈尔滨工业大学,2009.
[8]KUSNER M J,SUN Y,KOLKIN N I,et al.From word embeddings to document distances[C]∥International Conference on Machine Learning.2015:957-966.
[9]JASON.Document Similarity With Word Movers Distance[EB/OL].[2016-06-13].http://jxieeducation.com/2016-06-13/Document Similarity-With-Word-Movers-Distance.
[10]GUAN Y,WANG X,WANG Q.A New Measurement of Systematic Similarity [J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2008,38(4):743-758.
[11]郭胜国,邢丹丹.基于词向量的句子相似度计算及其应用研究 [J].现代电子技术,2016,39(13):99-102.
[12]李峰,侯加英,曾荣仁,等.融合词向量的多特征句子相似度计算方法研究 [J].计算机科学与探索,2017,11(4):608-618.
[13]JOHNSON J,DOUZE M,JÉGOU H.Billion-scale similarity sear-ch with GPUs [J].arXiv preprintarXiv:1702.08734,2017.
[14]RYGL J,POMIKÁLEK J,REHUUREK R,et al.Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines[C]∥The Workshop on Representation Learning for Nlp.2017:81-90.
[1] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[2] 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波.
语义通信系统的性能度量指标分析
Analysis of Performance Metrics of Semantic Communication Systems
计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[3] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[4] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[5] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[6] 杨进才, 曹元, 胡泉, 沈显君.
基于Transformer模型与关系词特征的汉语因果类复句关系自动识别
Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature
计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019
[7] 刘邦邦, 易国洪, 黄祖源.
面向Docker容器的动态负载算法
Dynamic Loading Algorithm for Docker Container
计算机科学, 2021, 48(6): 276-281. https://doi.org/10.11896/jsjkx.200500152
[8] 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立.
基于GPU加速的并行WMD算法
Parallel WMD Algorithm Based on GPU Acceleration
计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[9] 杨青, 张亚文, 朱丽, 吴涛.
基于注意力机制和BiGRU融合的文本情感分析
Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU
计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075
[10] 张玉帅, 赵欢, 李博.
基于BERT和BiLSTM的语义槽填充
Semantic Slot Filling Based on BERT and BiLSTM
计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088
[11] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[12] 史朝卫, 孟相如, 马志强, 韩晓阳.
拓扑综合评估与权值自适应的虚拟网络映射算法
Virtual Network Embedding Algorithm Based on Topology Comprehensive Evaluation and Weight Adaptation
计算机科学, 2020, 47(7): 236-242. https://doi.org/10.11896/jsjkx.190600022
[13] 李舟军,范宇,吴贤杰.
面向自然语言处理的预训练技术研究综述
Survey of Natural Language Processing Pre-training Techniques
计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167
[14] 贾经冬, 张筱曼, 郝璐, 谭火彬.
工业界需求工程关注点分析
Analysis of Focuses of Requirements Engineering in Industry
计算机科学, 2020, 47(12): 25-34. https://doi.org/10.11896/jsjkx.201200048
[15] 霍丹, 张生杰, 万路军.
基于上下文的情感词向量混合模型
Context-based Emotional Word Vector Hybrid Model
计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!