高维数据空间的性质及度量选择

Abstract

Abstract: High-dimensional data analysis is the core task of machine learning and data mining．By finding optimal subspace for data representation,dimensionality reduction algorithms can reduce computational cost and improve the performance of subsequent classification or clustering algorithms,leading to effective techniques for high-dimensional data analysis．However,there is very little guidance for theoretical analysis on high-dimensional data．This paper reviewed some statistical and geometrical properties of high-dimensional data space,and gave some intuitive explanations on “concentration of measure” phenomenon from different perspectives．In order to improve performances of classical machine learning algorithms based on distance metric,this paper discussed the effects of metric choice on high-dimensional data analysis．Empirical results show that fractional distance metric can improve performances of K Nearest Neighbor and Kmeans significantly.

Key words: High-dimensional data,Curse of dimensionality,Concentration of measure

HE Jin-rong,DING Li-xin,HU Qing-hui and LI Zhao-kui. Properties of High-dimensional Data Space and Metric Choice[J].Computer Science, 2014, 41(3): 212-217.

References

[1] Skillicorn D B．Understanding High-Dimensional Spaces[M]．Springer-Verlag New York Incorporated,2013
[2] Donoho D L．High-dimensional data analysis:The curses andblessings of dimensionality[J]．AMS Math Challenges Lecture,2000:1-32
[3] Bellman R.Adaptive Control Process:A Guide Tour[M]．Princeton University Press,Princeton,New Jersey,1961
[4] Fukunaga K．Introduction to Statistical Pattern Recognition(2nd ed)[M]．New York:Academic,1990, 39-40(31-34):220-221
[5] Mil’man V D．New proof of the theorem of A．Dvoretzky on intersections of convex bodies[J]．Functional Analysis and its Applications,1971,5(4):288-295
[6] Weber R,Schek H-J,Blott S．A quantitative analysis and per-formance study for similarity-search methods in high-dimensionalspaces[C]∥Proceedings of the 24rd International Conference on Very Large Data Bases,ser．VLDB’98．San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1998:194-205
[7] Gaede V,Günther O．Multidimensional access methods[J]．ACM Computing Surveys (CSUR),1998,30(2):170-231
[8] Francois D,Wertz V,Verleysen M．Non-euclidean metrics forsimilarity search in noisy datasets[C]∥Proc．of ESANN．2005
[9] Kouiroukidis N,Evangelidis G．The Effects of DimensionalityCurse in High Dimensional kNN Search[C]∥Informatics (PCI),201115th Panhellenic Conference on．IEEE,2011:41-45
[10] Clarke R,Ressom H W,Wang A,et al．The properties of high-dimensional data spaces:implications for exploring gene and protein expression data[J]．Nature Reviews Cancer,2008,8(1):37-49
[11] Jimenez L,Landgrebe D．Supervised Classification in High Dimensional Space:Geometrical,Statistical and Asymptotical Properties of Multivariate data[J]．IEEE Transactions on Geoscience and Remote Sensing,1999,37(6)
[12] Jimenez L,Landgrebe D．High dimensional feature reduction via projection pursuit[C]∥Geoscience and Remote Sensing Symposium,1994．IGARSS’94．Surface and Atmospheric Remote Sensing:Technologies,Data Analysis and Interpretation．International．IEEE,1994,2:1145-1147
[13] Rupp M,Schneider P,Schneider G．Distance phenomena in high-dimensional chemical descriptor spaces:Consequences for similarity-based approaches[J]．Journal of Computational Chemistry,2009,30(14):2285-2296
[14] Francois D,Wertz V,Verleysen M．The concentration of fractional distances[J]．IEEE Transactions on Knowledge and Data Engineering,2007,19(7):873-886
[15] Durrant R J,Kabán A．When is ‘nearest neighbor’meaningful:A converse theorem and implications[J]．Journal of Complexity,2009,25(4):385-397
[16] Beyer K,Goldstein J,Ramakrishnan R,et al．When is “nearest neighbor” meaningful?[M]//Database Theory—ICDT’99．Springer Berlin Heidelberg,1999:217-235
[17] Hinneburg A,Aggarwal C C,Keim D A．What is the nearest neighbor in high dimensional spaces?[M]．Bibliothek der Universitt Konstanz,2000
[18] Francois D,Wertz V,Verleysen M．Non-euclidean metrics forsimilarity search in noisy datasets[C]∥Proc．of ESANN．2005
[19] Hsu C M,Chen M S．On the design and applicability of distance functions in high-dimensional data space[J]． IEEE Transactions on Knowledge and Data Engineering,2009,21(4):523-536
[20] Aggarwal C C,Hinneburg A,Keim D A．On the surprising behavior of distance metrics in high dimensional spaces[C]∥Proceedings of the 8th International Conference on Database Theory,ser．ICDT ’01．London,UK:Springer-Verlag,2001:420-434
[21] Canal L．A normal approximation for the chi-square distribution[J]．Computational Statistics & Data Analysis,2005,48(4):803-808
[22] Katafygiotis L S,Zuev K M．Geometric insight into the challenges of solving high-dimensional reliability problems[J]．Probabilistic Engineering Mechanics,2008,23(2):208-218
[23] Wang Jian-zhong．Geometric Structure of High-Dimensional Data and Dimensionality Reduction[C]∥Higher Education Press (China) and Springer．Beijing,2011
[24] Hopcroft J,Kannan R．Computer Science Theory for the Information Age[M]．Spring,2012:7-27

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Properties of High-dimensional Data Space and Metric Choice

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0