Computer Science ›› 2018, Vol. 45 ›› Issue (11A): 113-116.

• Intelligent Computing • Previous Articles     Next Articles

WDS:Word Documents Similarity Based on Word Embedding

WANG Lu-qi1, LONG Jun1, YUAN Xin-pan2   

  1. School of Software,Central South University,Changsha 410075,China1
    School of Computer and Communication,Hunan University of Technology,Zhuzhou,Hunan 412000,China2
  • Online:2019-02-26 Published:2019-02-26

Abstract: In order to further improve the accuracy of document similarity,under the framework of system similarity function,this paper presented Word Documents Similarity (WDS) based on word embedding,and its optimization algorithm FWDS (Fast Word Documents Similarity).WDS regards the set of word embedding corresponding to the words set of documents as the system,and regards the word embedding corresponding to the word as the element of the system.So,the similarity of the documents is the similarity of the two word embedding sets.In the concrete calculation,the first vector set is used as the standard,the alignment operation of the two vector sets is carried out,and the multiple parameters of the sets that are in and not in MOPs are calculated.The experimental results show that compared with WMD and WJ,WDS always keep better hit rate with documents’ length increase.

Key words: Document similarity, MOP, System similarity function, Weight, Word embedding

CLC Number: 

  • TP301.6
[1]GOPALAN P,CHARLIN L,BLEI D M.Content-based recommendations with Poisson factorization [J].Advances in Neural Information Processing Systems,2014,4(31):76-84.
[2]MINCHEVA S.FBK-HLT:An Application of Semantic Textual Similarity for Answer Selection in Community Question Answering[C]∥Proceedings of the International Workshop on Semantic Evaluation.2015.
[3]KIM Y.Convolutional neural networks for sentence classification[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1746-1751.
[4]LIN C Y,OCH F J.Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics [J].Proceedings of Annual Meeting of the Association for Computational Linguistics,2004:605-612.
[5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[C]∥Proceedings of the International Conference on Learning Representations.2013.
[6]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed Representations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26(311):1-9.
[7]徐帅.面向问答系统的复述识别技术研究与实现 [D].哈尔滨:哈尔滨工业大学,2009.
[8]KUSNER M J,SUN Y,KOLKIN N I,et al.From word embeddings to document distances[C]∥International Conference on Machine Learning.2015:957-966.
[9]JASON.Document Similarity With Word Movers Distance[EB/OL].[2016-06-13].http://jxieeducation.com/2016-06-13/Document Similarity-With-Word-Movers-Distance.
[10]GUAN Y,WANG X,WANG Q.A New Measurement of Systematic Similarity [J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2008,38(4):743-758.
[11]郭胜国,邢丹丹.基于词向量的句子相似度计算及其应用研究 [J].现代电子技术,2016,39(13):99-102.
[12]李峰,侯加英,曾荣仁,等.融合词向量的多特征句子相似度计算方法研究 [J].计算机科学与探索,2017,11(4):608-618.
[13]JOHNSON J,DOUZE M,JÉGOU H.Billion-scale similarity sear-ch with GPUs [J].arXiv preprintarXiv:1702.08734,2017.
[14]RYGL J,POMIKÁLEK J,REHUUREK R,et al.Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines[C]∥The Workshop on Representation Learning for Nlp.2017:81-90.
[1] YANG Wen-kun, YUAN Xiao-pei, CHEN Xiao-feng, GUO Rui. Spatial Multi-feature Segmentation of 3D Lidar Point Cloud [J]. Computer Science, 2022, 49(8): 143-149.
[2] WANG Can, LIU Yong-jian, XIE Qing, MA Yan-chun. Anchor Free Object Detection Algorithm Based on Soft Label and Sample Weight Optimization [J]. Computer Science, 2022, 49(8): 157-164.
[3] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[4] HAO Qiang, LI Jie, ZHANG Man, WANG Lu. Spatial Non-cooperative Target Components Recognition Algorithm Based on Improved YOLOv3 [J]. Computer Science, 2022, 49(6A): 358-362.
[5] YU Ben-gong, ZHANG Zi-wei, WANG Hui-ling. TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information [J]. Computer Science, 2022, 49(6A): 165-171.
[6] CAI Xin-yu, FENG Xiang, YU Hui-qun. Adaptive Weight Based Broad Learning Algorithm for Cascaded Enhanced Nodes [J]. Computer Science, 2022, 49(6): 134-141.
[7] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[8] XU Hua-chi, SHI Dian-xi, CUI Yu-ning, JING Luo-xi, LIU Cong. Time Information Integration Network for Event Cameras [J]. Computer Science, 2022, 49(5): 43-49.
[9] TANG Chun-yang, XIAO Yu-zhi, ZHAO Hai-xing, YE Zhong-lin, ZHANG Na. EWCC Community Discovery Algorithm for Two-Layer Network [J]. Computer Science, 2022, 49(4): 49-55.
[10] ZHAO Liang, ZHANG Jie, CHEN Zhi-kui. Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization [J]. Computer Science, 2022, 49(4): 124-133.
[11] LI Si-quan, WAN Yong-jing, JIANG Cui-ling. Multiple Fundamental Frequency Estimation Algorithm Based on Generative Adversarial Networks for Image Removal [J]. Computer Science, 2022, 49(3): 179-184.
[12] LI Hao, ZHANG Lan, YANG Bing, YANG Hai-xiao, KOU Yong-qi, WANG Fei, KANG Yan. Fine-grained Sentiment Classification of Chinese Microblogs Combining Dual Weight Mechanismand Graph Convolutional Neural Network [J]. Computer Science, 2022, 49(3): 246-254.
[13] LI Yu-qiang, ZHANG Wei-jiang, HUANG Yu, LI Lin, LIU Ai-hua. Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution [J]. Computer Science, 2022, 49(2): 256-264.
[14] LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[15] LI Zhao-qi, LI Ta. Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining [J]. Computer Science, 2022, 49(1): 59-64.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!