计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 57-66.doi: 10.11896/j.issn.1002-137X.2018.06.010

• 第十四届全国Web信息系统及其应用学术会议 • 上一篇    下一篇

基于非对称多值特征杰卡德系数的高维语义向量差异性度量方法

冯艳红1,2, 于红1,2, 孙庚1,2, 彭松1   

  1. 大连海洋大学信息工程学院 大连1160231;
    大连海洋大学辽宁省海洋信息技术重点实验室 大连1160232
  • 出版日期:2018-06-15 发布日期:2018-07-24
  • 作者简介:冯艳红(1980-),女,硕士,讲师,CCF会员,主要研究方向为自然语言处理、机器学习;于 红(1968-),女,博士,教授,主要研究方向为数据挖掘、信息检索,E-mail:yuhong@dlou.edu.cn(通信作者);孙 庚(1979-),男,硕士,副教授,主要研究方向为嵌入式系统;彭 松(1993-),男,硕士生,主要研究方向为自然语言处理、机器学习
  • 基金资助:
    本文受大连市科技计划项目:海洋渔业大数据管理与集成关键技术研究(2015A11GX22),辽宁省大学生创新创业项目:渔业领域智能问答系统的研究与实现(201710158000131)资助

Diversity Measures Method in High-dimensional Semantic Vector Based on Asymmetric Multi-valued Feature Jaccard Coefficient

FENG Yan-hong1,2, YU Hong1,2, SUN Geng1,2, PENG Song1   

  1. College of Information Engineering,Dalian Ocean University,Dalian 116023,China1;
    Key Laboratory of Marine Information Technology of Liaoning Province,Dalian Ocean University,Dalian 116023,China2
  • Online:2018-06-15 Published:2018-07-24

摘要: 语义向量差异性度量是采用深度学习方法解决自然语言处理领域问题的重要基础。在高维语义向量差异性度量中存在“度量集中”问题,导致通过传统的度量方法得到的度量结果无法体现语义向量间的差异性。针对该问题,提出一种基于非对称多值特征杰卡德系数的差异性度量方法。由高维语义向量维度值的统计分布得出,部分维度的维度值密集地分布在特定值域内,导致其无法贡献差异度,因此不同维度对差异性的贡献量不同,具有非对称性。该方法定义了关于维度值的重要性函数,选取重要性函数值满足阈值的维度参与差异度计算,去掉无法贡献差异度的维度,从而实现了降维,缓解了“度量集中”问题。分别在渔业数据集和公开数据集上,对不同维度的语义向量的不同度量方法进行了比较,结果表明在语义性没有明显变差的情况下,所提方法的多样性指标较目前最优的度量方法有大幅提高。

关键词: 非对称多值特征, 杰卡德系数, 高维语义向量, 度量方法, 度量集中

Abstract: The diversity measures of semantic vector are important base of natural language processing problem resolved by deep learning methods.There is a problem of “measurement concentration” in the diversity measure of high dimension semantic vector,which leads to the diversity of the semantic vectors disappear when the diversity are obtained by the traditional measure methods.To resolve this problem,a diversity measures method based on the asymmetric multi-valued feature Jaccard coefficient was proposed.From the statistical distribution of the dimension values of the high-dimensional semantic vector,the values of the partial dimensions are densely distributed in a certain range,which makes them impossible to contribute the diversity.Therefore,the contribution of different dimensions to the diversity is diffe-rent and has asymmetry.This method defines the importance function about the dimension value,selects the dimensions of the importance function value satisfying the threshold to participate in the diversity calculation and removes the dimensions that can not contribute the diversity,and then realizes the dimensionality reduction and alleviates the problem of “measurement concentration”.The experiments were respectively conducted on fishery data sets and public data sets.Different measures methods of the different dimension semantic vector were compared.Under the condition that the semantic nature is not markedly reduced,the diversity index of theproposed method is much higher than the current optimal measures method.

Key words: Asymmetric multi-valued feature, Jaccard coefficient, High-dimensional semantic vector, Measures method, Measurement concentration

中图分类号: 

  • TP183
[1]中文信息处理发展报告[EB/OL].[2017-4-11].http://www.cipsc.org.cn/download.php?file=cips2016.pdf.
[2]PACCANARO A,HINTO G E.Learning distributed representations of concepts using linear relational embedding[J].IEEE Transactions on Knowledge & Data Engineering,2001,13(2):232-244.
[3]BENGIO Y,SCHWENK H,SENÉCAL J,et al.Neural Probabilistic Language Models[J].Journal of Machine Learning Research,2001,3(6):1137-1155.
[4]FENG Y H,YU H,SUN G,et al.Domain-specific Terminology Recognition Method Based on Word Embedding and CRF[J].Journal of Computer Applications,2016,36(11):3146-3151.(in Chinese)
冯艳红,于红,孙庚,等.基于词向量和CRF的领域术语识别方法[J].计算机应用,2016,36(11):3146-3151.
[5]YAN J,LIU W F,LIN H F.Music Recommendation Study Based on Tags Multi-Space[J].Journal of Chinese Information Processing,2014,28(4):117-122.(in Chinese)
闫俊,刘文飞,林鸿飞.基于标签混合语义空间的音乐推荐方法研究[J].中文信息学报,2014,28(4):117-122.
[6]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].ComputerScien-ce,arXiv:1301.378lv3.
[7]MIKOLOV T,SUTSKEVER I,CEHN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[8]BUHLMANN P,VAN DE GEER S.Statistics for High-Dimensional Data[M].Springer-Verlag Berlin Heidelberg,2011.
[9]BELLMAN R.Adaptive Control Process:A Guide Tour[M].Princeton University Press,Princeton,New Jersey,1961.
[10]FUKUNAGA K.Introduction to Statistical Pattern Recognition(2nd ed)[M].New York:Academicpress,1972.
[11]LEDOUX M.The concentration of measure phenomenon[J].Mathematical Surveys and Monographs,2001,89:94-124.
[12]HE L,CAI Y C,YANG Z.Researches on Similarity Measurement of High Dimensional Data[J].Computer Science,2010,37(5):155-156.(in Chinese)
贺玲,蔡益朝,杨征.高维数据的相似性度量研究[J].计算机科学,2010,37(5):155-156.
[13]HE J R,DING L X,HU Q H,et al.Properties of High-dimensional Data Space and Metric Choice[J].Computer Science,2014,41(3):212-217.(in Chinese)
何进荣,丁立新,胡庆辉,等.高维数据空间的性质及度量选择[J].计算机科学,2014,41(3):212-217.
[14]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods[J].Journal of Software,2011,22(1):28-43.(in Chinese)
陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43.
[15]FENG L,LIU S L,ZHANG J,et al.Robust Activation Function of Extreme Learning Machine and Linea Dimensionality Reduction in High-Dimensional Data[J].Journal of Computer Research and Development,2014,51(6):1331-1340.(in Chinese)
冯林,刘胜蓝,张晶,等.高维数据中鲁棒激活函数的极端学习机及线性降维[J].计算机研究与发展,2014,51(6):1331-1340.
[16]LAI S W.Word and Document Embedding Based on Neural Network Approaches[D].Beijing:University of Chinese Academy of Sciences,2016:27-39.(in Chinese)
来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学自动化研究所,2016:27-39.
[17]JACCARD P.Etude de la distribution florale dans une portion des Alpes et du Jura[J].Bulletin De La Societe Vaudoise Des Sciences Naturelles,1901,37(142):547-579.
[18]Jaccard index[EB/OL].[2017-4-29].https://en.wikipedia.org/wiki/Jaccard_index#cite_note-1.
[19]SAMANTHULA B K,JIANG W.Secure Multiset Intersection Cardinality and its Application to Jaccard Coefficient[J].IEEE Transactions on Dependable & Secure Computing,2016,13(5):1.
[20]CHENG Y,WANG S T.A Multiple Alternative Clusterings Mining Algorithm Using Locality Preserving Projections[J].CAAI Transactions on Intelligent Systems,2016,11(5):600-607.(in Chinese)
程旸,王士同.基于局部保留投影的多可选聚类发掘算法[J].智能系统学报,2016,11(5):600-607.
[21]LIAO B,ZHANG T,YU J,et al.Efficiency Optimization of Jaccard's Similarity Coefficient Based on Two Dimensional Partition[J].Computer Science,2017,44(1):219-225.(in Chinese)
廖彬,张陶,于炯,等.基于二维划分的杰卡德相似系数批量计算效率优化[J].计算机科学,2017,44(1):219-225.
[22]TANIMOTO T T.An Elementary Mathematical theory of Classification and Prediction[R].Internal IBM Technical Report,1957.
[23]ROGERS ,TANIMOTO D J,TAFFEE T.A Computer Program for Classifying Plants[J].Science,1960,132(3434):1115-1118.
[24] 潘迎捷.水产辞典[M].上海:上海辞书出版社,2007.
[25]搜狗全网新闻数据(SogouCA)[EB/OL].[2017-02-14].http://www.sogou.com/labs/dl/ca.html.
[1] 张同明, 张宁. 股票市场投资者情绪指数研究综述[J]. 计算机科学, 2021, 48(6A): 143-150.
[2] 刘启林,董威,尹良泽,齐璇,杨沙洲. 混源软件质量模型与度量方法研究[J]. 计算机科学, 2017, 44(4): 82-84.
[3] 胡文生,杨剑锋,赵明. 类设计质量评估方法的研究[J]. 计算机科学, 2017, 44(12): 150-155.
[4] 何进荣,丁立新,胡庆辉,李照奎. 高维数据空间的性质及度量选择[J]. 计算机科学, 2014, 41(3): 212-217.
[5] 王忠杰 徐晓飞 战德臣. 基于熵的信息系统业务模型复杂性度量[J]. 计算机科学, 2006, 33(1): 104-107.
[6] . 关于Vague集相似度量的一个注记[J]. 计算机科学, 2005, 32(10): 170-171.
[7] 甘早斌 陈正勇 陈传波 裴先登. SRS及其质量模糊度量方法的研究[J]. 计算机科学, 2003, 30(4): 131-133.
[8] 张涌 陶隽 钱乐秋. 一种基于UML的类复杂性度量方法[J]. 计算机科学, 2002, 29(10): 128-132.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 编辑部. 新网站开通,欢迎大家订阅![J]. 计算机科学, 2018, 1(1): 1 .
[2] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[3] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[4] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[5] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[6] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[7] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[8] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[9] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[10] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .