计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 113-116.
王路琪1, 龙军1, 袁鑫攀2
WANG Lu-qi1, LONG Jun1, YUAN Xin-pan2
摘要: 为进一步提高文本相似度计算的准确性,在系统相似函数的架构下,提出了基于词向量的文本相似函数WDS(Word Documents Similarity)及其优化算法FWDS(Fast Word Documents Similarity)。该函数将文本词语集合对应的词向量集合看作系统,将词语对应的词向量看作系统的元素,则两个文本相似度就是两个向量集合的相似度。在具体计算时,以第一个向量集合为标准进行两个向量集合的对齐操作,同时计算相似元与非相似元的多个参数。实验结果表明,随着文本长度的增加,与WMD和WJ算法相比,WDS表现出了较高的命中率。
中图分类号:
[1]GOPALAN P,CHARLIN L,BLEI D M.Content-based recommendations with Poisson factorization [J].Advances in Neural Information Processing Systems,2014,4(31):76-84. [2]MINCHEVA S.FBK-HLT:An Application of Semantic Textual Similarity for Answer Selection in Community Question Answering[C]∥Proceedings of the International Workshop on Semantic Evaluation.2015. [3]KIM Y.Convolutional neural networks for sentence classification[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1746-1751. [4]LIN C Y,OCH F J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics [J].Proceedings of Annual Meeting of the Association for Computational Linguistics,2004:605-612. [5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[C]∥Proceedings of the International Conference on Learning Representations.2013. [6]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed Representations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26(311):1-9. [7]徐帅.面向问答系统的复述识别技术研究与实现 [D].哈尔滨:哈尔滨工业大学,2009. [8]KUSNER M J,SUN Y,KOLKIN N I,et al.From word embeddings to document distances[C]∥International Conference on Machine Learning.2015:957-966. [9]JASON.Document Similarity With Word Movers Distance[EB/OL].[2016-06-13].http://jxieeducation.com/2016-06-13/Document Similarity-With-Word-Movers-Distance. [10]GUAN Y,WANG X,WANG Q.A New Measurement of Systematic Similarity [J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2008,38(4):743-758. [11]郭胜国,邢丹丹.基于词向量的句子相似度计算及其应用研究 [J].现代电子技术,2016,39(13):99-102. [12]李峰,侯加英,曾荣仁,等.融合词向量的多特征句子相似度计算方法研究 [J].计算机科学与探索,2017,11(4):608-618. [13]JOHNSON J,DOUZE M,JÉGOU H.Billion-scale similarity sear-ch with GPUs [J].arXiv preprintarXiv:1702.08734,2017. [14]RYGL J,POMIKÁLEK J,REHUUREK R,et al.Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines[C]∥The Workshop on Representation Learning for Nlp.2017:81-90. |
[1] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[2] | 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波. 语义通信系统的性能度量指标分析 Analysis of Performance Metrics of Semantic Communication Systems 计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071 |
[3] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[4] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[5] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[6] | 杨进才, 曹元, 胡泉, 沈显君. 基于Transformer模型与关系词特征的汉语因果类复句关系自动识别 Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature 计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019 |
[7] | 刘邦邦, 易国洪, 黄祖源. 面向Docker容器的动态负载算法 Dynamic Loading Algorithm for Docker Container 计算机科学, 2021, 48(6): 276-281. https://doi.org/10.11896/jsjkx.200500152 |
[8] | 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213 |
[9] | 杨青, 张亚文, 朱丽, 吴涛. 基于注意力机制和BiGRU融合的文本情感分析 Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU 计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075 |
[10] | 张玉帅, 赵欢, 李博. 基于BERT和BiLSTM的语义槽填充 Semantic Slot Filling Based on BERT and BiLSTM 计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088 |
[11] | 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆. 一种低频词词向量优化方法及其在短文本分类中的应用 Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification 计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163 |
[12] | 史朝卫, 孟相如, 马志强, 韩晓阳. 拓扑综合评估与权值自适应的虚拟网络映射算法 Virtual Network Embedding Algorithm Based on Topology Comprehensive Evaluation and Weight Adaptation 计算机科学, 2020, 47(7): 236-242. https://doi.org/10.11896/jsjkx.190600022 |
[13] | 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述 Survey of Natural Language Processing Pre-training Techniques 计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167 |
[14] | 贾经冬, 张筱曼, 郝璐, 谭火彬. 工业界需求工程关注点分析 Analysis of Focuses of Requirements Engineering in Industry 计算机科学, 2020, 47(12): 25-34. https://doi.org/10.11896/jsjkx.201200048 |
[15] | 霍丹, 张生杰, 万路军. 基于上下文的情感词向量混合模型 Context-based Emotional Word Vector Hybrid Model 计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114 |
|