计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 24-28.doi: 10.11896/jsjkx.210600213
胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立
HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li
摘要: Word Mover's Distance(WMD)是一种度量文本相似度的方法,它将两个文本之间的差异定义为文本的词嵌入向量之间的最小距离。WMD利用词汇表,将文本表示为归一化的词袋向量。文本的单词在语料中所占的比例很小,因此用词袋模型生成的文本向量很稀疏。多个文本可以组成一个高维的稀疏矩阵,这样的稀疏矩阵会生成大量不必要的运算。通过一次性对多个目标文本计算单个源文本的WMD,可以使计算过程高度并行化。针对文本向量的稀疏性,文中提出了一种基于GPU的并行Sinkhorn-WMD算法,采取压缩格式存储目标文本的方式来提高内存利用率,根据稀疏结构减少中间过程的计算。利用预训练词嵌入向量计算单词距离矩阵,对WMD算法进行改进,在两个公开的新闻数据集上进行优化算法的验证。实验结果表明,在NVIDIA TITAN RTX上并行算法与CPU串行相比最高可以达到67.43倍的加速。
中图分类号:
[1]BROKOS G I,MALAKASIOTIS P,ANDROUTSOPOULOS I.Using Centroids of Word Embeddings and Word Mover's Distance for Biomedical Document Retrieval in Question Answering[J].arXiv:1608.03905,2016. [2]LU C,ARONSON A,SHOOSHAN S,et al.Spell checker for consumer language (CSpell)[J].Journal of the American Medical Informatics Association,2019,26:211-218. [3]ALZAHRANI S,SALIM N.Fuzzy semantic-based string similarity for extrinsic plagiarism detection[J].Braschler and Harman,2010,1176:1-8. [4]KUSNER M,SUN Y,KOLKIN N,et al.From word embeddings to document distances[C]//International Conference on Machine Learning.PMLR,2015:957-966. [5]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119. [6]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013. [7]LIU B.Text sentiment analysis based on CBOW model and deep learning in big data environment[J].Journal of Ambient Intelligence and Humanized Computing,2020,11(2):451-458. [8]SHAZEER N,PELEMANS J,CHELBA C.Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation[J].arXiv:1412.1454,2015. [9]TITHI J J,PETRINI F.An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance[J].arXiv:2005.06727,2021. [10]RUBNER Y,TOMASI C,GUIBAS L J.The Earth Mover's Distance as a Metric for Image Retrieval[J].International Journal of Computer Vision,2000,40(2):99-121. [11]PEYRÉ G,CUTURI M.Computational optimal transport:With applications to data science[J].Foundations and Trends© in Machine Learning,2019,11(5/6):355-607. [12]ALVAREZ-MELIS D,JEGELKA S,JAAKKOLA T S.To- wards optimal transport with global invariances[C]//22nd International Conference on Artificial Intelligence and Statistics.PMLR,2019:1870-1879. [13]CORENFLOS A,THORNTON J,DOUCET A,et al.Differen- tiable Particle Filtering via Entropy-Regularized Optimal Transport[J].arXiv:2102.07850,2021. [14]FEYDY J,ROUSSILLON P,TROUVÉ A,et al.Fast and scalable optimal transport for brain tractograms[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham:Springer,2019:636-644. [15]CUTURI M.Sinkhorn Distances:Lightspeed Computation of Optimal Transportation Distances[J].arXiv:1306.0895,2013. [16]KNIGHT P A.The Sinkhorn-Knopp Algorithm:Convergence and Applications[J].SIAM Journal on Matrix Analysis and Applications,2008,30(1):261-275. [17]CUDA C++ Best Practices Guide[EB/OL].(2021-07-13). http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html. |
[1] | 宗迪迪, 谢益武. 基于法线迭代的模型中轴生成方法 Model Medial Axis Generation Method Based on Normal Iteration 计算机科学, 2022, 49(6A): 764-770. https://doi.org/10.11896/jsjkx.210400050 |
[2] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[3] | 汪晋, 刘江. 基于GPU的并行DILU预处理技术 GPU-based Parallel DILU Preconditioning Technique 计算机科学, 2022, 49(6): 108-118. https://doi.org/10.11896/jsjkx.210300259 |
[4] | 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115 |
[5] | 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051 |
[6] | 李繁, 严星, 张晓宇. 基于GPU的特征脸算法优化研究 Optimization of GPU-based Eigenface Algorithm 计算机科学, 2021, 48(4): 197-204. https://doi.org/10.11896/jsjkx.200600033 |
[7] | 冯凯, 马鑫玉. (n,k)-冒泡排序网络的子网络可靠性 Subnetwork Reliability of (n,k)-bubble-sort Networks 计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139 |
[8] | 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术 Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data 计算机科学, 2020, 47(9): 117-122. https://doi.org/10.11896/jsjkx.190800121 |
[9] | 陈国良, 张玉杰. 并行计算学科发展历程 Development of Parallel Computing Subject 计算机科学, 2020, 47(8): 1-4. https://doi.org/10.11896/jsjkx.200600027 |
[10] | 阳王东, 王昊天, 张宇峰, 林圣乐, 蔡沁耘. 异构混合并行计算综述 Survey of Heterogeneous Hybrid Parallel Computing 计算机科学, 2020, 47(8): 5-16. https://doi.org/10.11896/jsjkx.200600045 |
[11] | 冯凯, 李婧. k元n方体的子网络可靠性研究 Study on Subnetwork Reliability of k-ary n-cubes 计算机科学, 2020, 47(7): 31-36. https://doi.org/10.11896/jsjkx.190700170 |
[12] | 刘世芳, 赵永华, 于天禹, 黄荣锋. 广义稠密对称特征问题标准化算法在GPU集群上的有效实现 Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster 计算机科学, 2020, 47(4): 6-12. https://doi.org/10.11896/jsjkx.191000009 |
[13] | 左宪禹, 张哲, 苏岳瀚, 刘扬, 葛强, 田军锋. 基于GPU多流并发并行模型的NDVI提取算法 Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model 计算机科学, 2020, 47(4): 25-29. https://doi.org/10.11896/jsjkx.190500029 |
[14] | 杨宗霖, 李天瑞, 刘胜久, 殷成凤, 贾真, 珠杰. 基于Spark Streaming的流式并行文本校对 Streaming Parallel Text Proofreading Based on Spark Streaming 计算机科学, 2020, 47(4): 36-41. https://doi.org/10.11896/jsjkx.190300070 |
[15] | 贾经冬, 张筱曼, 郝璐, 谭火彬. 工业界需求工程关注点分析 Analysis of Focuses of Requirements Engineering in Industry 计算机科学, 2020, 47(12): 25-34. https://doi.org/10.11896/jsjkx.201200048 |
|