计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 57-66.doi: 10.11896/j.issn.1002-137X.2018.06.010
• 第十四届全国Web信息系统及其应用学术会议 • 上一篇 下一篇
冯艳红1,2, 于红1,2, 孙庚1,2, 彭松1
FENG Yan-hong1,2, YU Hong1,2, SUN Geng1,2, PENG Song1
摘要: 语义向量差异性度量是采用深度学习方法解决自然语言处理领域问题的重要基础。在高维语义向量差异性度量中存在“度量集中”问题,导致通过传统的度量方法得到的度量结果无法体现语义向量间的差异性。针对该问题,提出一种基于非对称多值特征杰卡德系数的差异性度量方法。由高维语义向量维度值的统计分布得出,部分维度的维度值密集地分布在特定值域内,导致其无法贡献差异度,因此不同维度对差异性的贡献量不同,具有非对称性。该方法定义了关于维度值的重要性函数,选取重要性函数值满足阈值的维度参与差异度计算,去掉无法贡献差异度的维度,从而实现了降维,缓解了“度量集中”问题。分别在渔业数据集和公开数据集上,对不同维度的语义向量的不同度量方法进行了比较,结果表明在语义性没有明显变差的情况下,所提方法的多样性指标较目前最优的度量方法有大幅提高。
中图分类号:
[1]中文信息处理发展报告[EB/OL].[2017-4-11].http://www.cipsc.org.cn/download.php?file=cips2016.pdf. [2]PACCANARO A,HINTO G E.Learning distributed representations of concepts using linear relational embedding[J].IEEE Transactions on Knowledge & Data Engineering,2001,13(2):232-244. [3]BENGIO Y,SCHWENK H,SENÉCAL J,et al.Neural Probabilistic Language Models[J].Journal of Machine Learning Research,2001,3(6):1137-1155. [4]FENG Y H,YU H,SUN G,et al.Domain-specific Terminology Recognition Method Based on Word Embedding and CRF[J].Journal of Computer Applications,2016,36(11):3146-3151.(in Chinese) 冯艳红,于红,孙庚,等.基于词向量和CRF的领域术语识别方法[J].计算机应用,2016,36(11):3146-3151. [5]YAN J,LIU W F,LIN H F.Music Recommendation Study Based on Tags Multi-Space[J].Journal of Chinese Information Processing,2014,28(4):117-122.(in Chinese) 闫俊,刘文飞,林鸿飞.基于标签混合语义空间的音乐推荐方法研究[J].中文信息学报,2014,28(4):117-122. [6]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].ComputerScien-ce,arXiv:1301.378lv3. [7]MIKOLOV T,SUTSKEVER I,CEHN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119. [8]BUHLMANN P,VAN DE GEER S.Statistics for High-Dimensional Data[M].Springer-Verlag Berlin Heidelberg,2011. [9]BELLMAN R.Adaptive Control Process:A Guide Tour[M].Princeton University Press,Princeton,New Jersey,1961. [10]FUKUNAGA K.Introduction to Statistical Pattern Recognition(2nd ed)[M].New York:Academicpress,1972. [11]LEDOUX M.The concentration of measure phenomenon[J].Mathematical Surveys and Monographs,2001,89:94-124. [12]HE L,CAI Y C,YANG Z.Researches on Similarity Measurement of High Dimensional Data[J].Computer Science,2010,37(5):155-156.(in Chinese) 贺玲,蔡益朝,杨征.高维数据的相似性度量研究[J].计算机科学,2010,37(5):155-156. [13]HE J R,DING L X,HU Q H,et al.Properties of High-dimensional Data Space and Metric Choice[J].Computer Science,2014,41(3):212-217.(in Chinese) 何进荣,丁立新,胡庆辉,等.高维数据空间的性质及度量选择[J].计算机科学,2014,41(3):212-217. [14]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods[J].Journal of Software,2011,22(1):28-43.(in Chinese) 陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43. [15]FENG L,LIU S L,ZHANG J,et al.Robust Activation Function of Extreme Learning Machine and Linea Dimensionality Reduction in High-Dimensional Data[J].Journal of Computer Research and Development,2014,51(6):1331-1340.(in Chinese) 冯林,刘胜蓝,张晶,等.高维数据中鲁棒激活函数的极端学习机及线性降维[J].计算机研究与发展,2014,51(6):1331-1340. [16]LAI S W.Word and Document Embedding Based on Neural Network Approaches[D].Beijing:University of Chinese Academy of Sciences,2016:27-39.(in Chinese) 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学自动化研究所,2016:27-39. [17]JACCARD P.Etude de la distribution florale dans une portion des Alpes et du Jura[J].Bulletin De La Societe Vaudoise Des Sciences Naturelles,1901,37(142):547-579. [18]Jaccard index[EB/OL].[2017-4-29].https://en.wikipedia.org/wiki/Jaccard_index#cite_note-1. [19]SAMANTHULA B K,JIANG W.Secure Multiset Intersection Cardinality and its Application to Jaccard Coefficient[J].IEEE Transactions on Dependable & Secure Computing,2016,13(5):1. [20]CHENG Y,WANG S T.A Multiple Alternative Clusterings Mining Algorithm Using Locality Preserving Projections[J].CAAI Transactions on Intelligent Systems,2016,11(5):600-607.(in Chinese) 程旸,王士同.基于局部保留投影的多可选聚类发掘算法[J].智能系统学报,2016,11(5):600-607. [21]LIAO B,ZHANG T,YU J,et al.Efficiency Optimization of Jaccard's Similarity Coefficient Based on Two Dimensional Partition[J].Computer Science,2017,44(1):219-225.(in Chinese) 廖彬,张陶,于炯,等.基于二维划分的杰卡德相似系数批量计算效率优化[J].计算机科学,2017,44(1):219-225. [22]TANIMOTO T T.An Elementary Mathematical theory of Classification and Prediction[R].Internal IBM Technical Report,1957. [23]ROGERS ,TANIMOTO D J,TAFFEE T.A Computer Program for Classifying Plants[J].Science,1960,132(3434):1115-1118. [24] 潘迎捷.水产辞典[M].上海:上海辞书出版社,2007. [25]搜狗全网新闻数据(SogouCA)[EB/OL].[2017-02-14].http://www.sogou.com/labs/dl/ca.html. |
[1] | 张同明, 张宁. 股票市场投资者情绪指数研究综述 Review of Research on Investor Sentiment Index in Stock Market 计算机科学, 2021, 48(6A): 143-150. https://doi.org/10.11896/jsjkx.201000016 |
[2] | 刘启林,董威,尹良泽,齐璇,杨沙洲. 混源软件质量模型与度量方法研究 Research on Mixed Source Software Quality Model and Measurement Method 计算机科学, 2017, 44(4): 82-84. https://doi.org/10.11896/j.issn.1002-137X.2017.04.018 |
[3] | 胡文生,杨剑锋,赵明. 类设计质量评估方法的研究 Methodology for Classes Design Quality Assessment 计算机科学, 2017, 44(12): 150-155. https://doi.org/10.11896/j.issn.1002-137X.2017.12.029 |
[4] | 何进荣,丁立新,胡庆辉,李照奎. 高维数据空间的性质及度量选择 Properties of High-dimensional Data Space and Metric Choice 计算机科学, 2014, 41(3): 212-217. |
[5] | 王忠杰 徐晓飞 战德臣. 基于熵的信息系统业务模型复杂性度量 计算机科学, 2006, 33(1): 104-107. |
[6] | . 关于Vague集相似度量的一个注记 计算机科学, 2005, 32(10): 170-171. |
[7] | 甘早斌 陈正勇 陈传波 裴先登. SRS及其质量模糊度量方法的研究 计算机科学, 2003, 30(4): 131-133. |
[8] | 张涌 陶隽 钱乐秋. 一种基于UML的类复杂性度量方法 计算机科学, 2002, 29(10): 128-132. |
|