计算机科学 ›› 2014, Vol. 41 ›› Issue (3): 212-217.

• 人工智能 • 上一篇    下一篇

高维数据空间的性质及度量选择

何进荣,丁立新,胡庆辉,李照奎   

  1. 武汉大学计算机学院软件工程国家重点实验室 武汉430072;武汉大学计算机学院软件工程国家重点实验室 武汉430072;武汉大学计算机学院软件工程国家重点实验室 武汉430072;武汉大学计算机学院软件工程国家重点实验室 武汉430072
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受中央高校基本科研业务费专项资金(2012211020209),广东省省部产学研结合专项(2011B090400477),珠海市产学研合作专项资金(2011A050101005,2012D0501990016),珠海市重点实验室科技攻关项目(2012D0501990026)资助

Properties of High-dimensional Data Space and Metric Choice

HE Jin-rong,DING Li-xin,HU Qing-hui and LI Zhao-kui   

  • Online:2018-11-14 Published:2018-11-14

摘要: 高维数据分析是机器学习和数据挖掘研究中的主要内容,降维算法通过寻找数据表示的最优子空间来约减维数,在降低计算代价的同时,也提高了后续分类或者聚类算法的性能,从而成为高维数据分析的有效手段。然而,目前缺乏高维数据分析的理论指导。对高维数据空间的统计和几何性质进行了综述,从不同的角度给出了高维数据空间中“度量集中”现象的直观解释,并讨论了通过度量选择的方式来提高经典的基于距离度量的机器学习算法在分析高维数据时的性能。实验表明,分数距离度量方式可以显著提高K近邻和Kmeans算法的性能。

关键词: 高维数据,维数灾难,度量集中 中图法分类号TP181文献标识码A

Abstract: High-dimensional data analysis is the core task of machine learning and data mining.By finding optimal subspace for data representation,dimensionality reduction algorithms can reduce computational cost and improve the performance of subsequent classification or clustering algorithms,leading to effective techniques for high-dimensional data analysis.However,there is very little guidance for theoretical analysis on high-dimensional data.This paper reviewed some statistical and geometrical properties of high-dimensional data space,and gave some intuitive explanations on “concentration of measure” phenomenon from different perspectives.In order to improve performances of classical machine learning algorithms based on distance metric,this paper discussed the effects of metric choice on high-dimensional data analysis.Empirical results show that fractional distance metric can improve performances of K Nearest Neighbor and Kmeans significantly.

Key words: High-dimensional data,Curse of dimensionality,Concentration of measure

[1] Skillicorn D B.Understanding High-Dimensional Spaces[M].Springer-Verlag New York Incorporated,2013
[2] Donoho D L.High-dimensional data analysis:The curses andblessings of dimensionality[J].AMS Math Challenges Lecture,2000:1-32
[3] Bellman R.Adaptive Control Process:A Guide Tour[M].Princeton University Press,Princeton,New Jersey,1961
[4] Fukunaga K.Introduction to Statistical Pattern Recognition(2nd ed)[M].New York:Academic,1990, 39-40(31-34):220-221
[5] Mil’man V D.New proof of the theorem of A.Dvoretzky on intersections of convex bodies[J].Functional Analysis and its Applications,1971,5(4):288-295
[6] Weber R,Schek H-J,Blott S.A quantitative analysis and per-formance study for similarity-search methods in high-dimensionalspaces[C]∥Proceedings of the 24rd International Conference on Very Large Data Bases,ser.VLDB’98.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1998:194-205
[7] Gaede V,Günther O.Multidimensional access methods[J].ACM Computing Surveys (CSUR),1998,30(2):170-231
[8] Francois D,Wertz V,Verleysen M.Non-euclidean metrics forsimilarity search in noisy datasets[C]∥Proc.of ESANN.2005
[9] Kouiroukidis N,Evangelidis G.The Effects of DimensionalityCurse in High Dimensional kNN Search[C]∥Informatics (PCI),201115th Panhellenic Conference on.IEEE,2011:41-45
[10] Clarke R,Ressom H W,Wang A,et al.The properties of high-dimensional data spaces:implications for exploring gene and protein expression data[J].Nature Reviews Cancer,2008,8(1):37-49
[11] Jimenez L,Landgrebe D.Supervised Classification in High Dimensional Space:Geometrical,Statistical and Asymptotical Properties of Multivariate data[J].IEEE Transactions on Geoscience and Remote Sensing,1999,37(6)
[12] Jimenez L,Landgrebe D.High dimensional feature reduction via projection pursuit[C]∥Geoscience and Remote Sensing Symposium,1994.IGARSS’94.Surface and Atmospheric Remote Sensing:Technologies,Data Analysis and Interpretation.International.IEEE,1994,2:1145-1147
[13] Rupp M,Schneider P,Schneider G.Distance phenomena in high-dimensional chemical descriptor spaces:Consequences for similarity-based approaches[J].Journal of Computational Chemistry,2009,30(14):2285-2296
[14] Francois D,Wertz V,Verleysen M.The concentration of fractional distances[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(7):873-886
[15] Durrant R J,Kabán A.When is ‘nearest neighbor’meaningful:A converse theorem and implications[J].Journal of Complexity,2009,25(4):385-397
[16] Beyer K,Goldstein J,Ramakrishnan R,et al.When is “nearest neighbor” meaningful?[M]//Database Theory—ICDT’99.Springer Berlin Heidelberg,1999:217-235
[17] Hinneburg A,Aggarwal C C,Keim D A.What is the nearest neighbor in high dimensional spaces?[M].Bibliothek der Universitt Konstanz,2000
[18] Francois D,Wertz V,Verleysen M.Non-euclidean metrics forsimilarity search in noisy datasets[C]∥Proc.of ESANN.2005
[19] Hsu C M,Chen M S.On the design and applicability of distance functions in high-dimensional data space[J]. IEEE Transactions on Knowledge and Data Engineering,2009,21(4):523-536
[20] Aggarwal C C,Hinneburg A,Keim D A.On the surprising behavior of distance metrics in high dimensional spaces[C]∥Proceedings of the 8th International Conference on Database Theory,ser.ICDT ’01.London,UK:Springer-Verlag,2001:420-434
[21] Canal L.A normal approximation for the chi-square distribution[J].Computational Statistics & Data Analysis,2005,48(4):803-808
[22] Katafygiotis L S,Zuev K M.Geometric insight into the challenges of solving high-dimensional reliability problems[J].Probabilistic Engineering Mechanics,2008,23(2):208-218
[23] Wang Jian-zhong.Geometric Structure of High-Dimensional Data and Dimensionality Reduction[C]∥Higher Education Press (China) and Springer.Beijing,2011
[24] Hopcroft J,Kannan R.Computer Science Theory for the Information Age[M].Spring,2012:7-27

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!