计算机科学 ›› 2015, Vol. 42 ›› Issue (Z6): 491-499.

• 数据挖掘 • 上一篇    下一篇

聚类算法综述

伍育红   

  1. 重庆邮电大学移通学院 重庆401539
  • 出版日期:2018-11-14 发布日期:2018-11-14

General Overview on Clustering Algorithms

WU Yu-hong   

  • Online:2018-11-14 Published:2018-11-14

摘要: 数据挖掘技术可以从大量数据中发现潜在的、有价值的知识,它给人们在信息时代所积累的海量数据赋予了新的意义。随着数据挖掘技术的迅速发展,作为其重要的组成部分,网格聚类技术已经被广泛应用于数据分析、图像处理、市场研究等许多领域。网格聚类算法研究已经成为数据挖掘研究领域中非常活跃的一个研究课题。 介绍了数据挖掘理论,对网格聚类算法进行了深入的分析研究。在研究了传统网格聚类算法的基础上,提出了一些改进的网格聚类算法,这些算法相比传统网格聚类算法有更好的聚类质量和效率。在分析了传统的多密度聚类算法的基础上,提出了基于网格的多密度聚类算法(Grid-based Clustering Algorithm for Multi-density)[1],该算法主要采用密度阈值递减的多阶段聚类技术提取不同密度的聚类,同时对聚类结果进行了人工干预。研究结果表明,基于网格的多密度聚类算法不仅能够对数据集进行正确的聚类,同时还能有效地弥补孤立点检测,有效地解决了传统多密度聚类算法不能有效识别孤立点和噪声的缺陷。基于网格的多密度聚类算法比传统的共享近邻SNN算法精度高,适合于均匀密度数据集、大部分多密度数据集,并且可以发现任意形状的聚类,对噪声数据和数据输入顺序不敏感,但对小部分多密度数据集的聚类结果不理想[1]。

Abstract: Data mining techniques can be used to find out potential and useful knowledge from the vast amount of data,and it plays a new significant role to the stored data in the info-times.With the rapid development of the data mining techniques,the technique of grid clustering,as important parts of data mining,are widely applied to the fields such as pattern recognition,data analysis,image processing,and market research.Research on grid clustering algorithms has become a highly active topic in the data mining research.In this thesis,the author presented the theory of data mining,and deeply analyzes the algorithms of grid clustering.Based on the analysis of traditional grid clustering algorithms,we advanced some improved grid clustering algorithms that can enhance the quality and efficiency of grid clustering compared with the traditional grid clustering algorithms.Based on the analysis of traditional algorithms for multi-density,we advanced a grid-based clustering algorithm for multi-density(GDD).The GDD is a kind of the multi-stage clustering that integrates grid-based clustering,the technique of density threshold descending and border points extraction.As shown in the research,GDD algorithm can not only clusters correctly but find outliers in the dataset,and it effectively solves the problem that traditional grid algorithms can cluster only or find outliers only.The precision of GDD algorithm is better than that of SNN.The GDD algorithm works well for even density dataset and lots of multi-density datasets;it can discover clusters of arbitrary shapes;it isn’t sensitive to the input order of noises and outliers data,but it is imperfect to cluster on some multi-density datasets.

Key words: Grid clustering,Density threshold descending,Multi-stage clustering

[1] 马刚,李志刚.数据仓库与数据挖掘的原理及应用[M].北京:高等教育出版社,2012:20-42
[2] 陈志泊.数据仓库与数据挖掘[M].北京:清华大学出版社,2011:8-37
[3] Tan Pang-ning,Steinbach M,Kumar V.数据挖掘导论[M].范明,译.北京:人民邮电出版社,2013:6-53
[4] Dunham M H.DATA MINING Introductory and AdvancedTopics [M].北京:清华大学出版社,2010:23-60
[5] Ng R T,Han J.Efficient and effective clustering methods forspatial data mining[C]∥Proc of the 20th VLDB Conference.Chile,Santia,2010:144-155
[6] Spivak G .Victory in Limbo:Imagism [C].Nelson C,Grasberg L,eds.Urbana:University of Illinois Press,2010:271-313
[7] Zhang T,Rrmakrishnan R,Livny M.An efficient data clustering method for very large databases[C]∥Proc of ACM SIGMOD International Conference on Management of Data.New York:ACM Press,2012:103-114
[8] Tan Pang-ning,Steinbach M.Introduction to Data Mining[M].2010:372-373
[9] Chen Y,Tu L.Density-Based Clustering for Real-Time Stream Data[C]∥ Proceedings of the 13th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining.San Jose,California,USA,2009:133-142
[10] 曹洪其,余岚,孙志辉.基于网格聚类技术的离群点挖掘算法[J].计算机工程,2006(11):18-96
[11] 孙玉芬.基于网格方法的聚类算法研究[D].武汉:华中科技大学,2011
[12] Han J,Kamber M.Data Mining:Concepts and Techniques [J].Morgan Kaufmann Publishers,2011,2(9):33-82
[13] Chen Ming-yan,Han Jia-wei,Philip S Y.Data mining:an overview from a database perspective [J].IEEE Trans on Know-ledge and Data Eng.,1996,8(6):806-833

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!