计算机科学 ›› 2017, Vol. 44 ›› Issue (Z11): 442-447.doi: 10.11896/j.issn.1002-137X.2017.11A.094

• 大数据与数据挖掘 • 上一篇    下一篇

基于区间数的多维不确定性数据UID-DBSCAN聚类算法

魏方圆,黄德才   

  1. 浙江工业大学计算机科学与技术学院 杭州310023,浙江工业大学计算机科学与技术学院 杭州310023
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受水利部公益性行业科研专项(201401044)资助

UID-DBSCAN Clustering Algorithm of Multi-dimensional Uncertain Data Based on Interval Number

WEI Fang-yuan and HUANG De-cai   

  • Online:2018-12-01 Published:2018-12-01

摘要: 不确定性数据聚类方法的研究日益受到广泛关注,其中UIDK-means算法与U-PAM算法继承了基于划分算法无法识别任意形状簇和对噪声点敏感的缺陷。FDBSCAN算法事先假定不确定性数据的概率分布函数或概率密度函数是已知的,然而这些信息在实际应用中往往难以获取。针对上述算法的不足,提出一种基于区间数的多维不确定性数据聚类UID-DBSCAN算法。该算法利用区间数结合数据的统计信息合理地表示不确定性数据,采用低计算复杂度的区间数距离函数衡量不确定性数据对象间的相似度,首次提出区间数的密度、密度可达与密度相连等概念,并将其用于扩展簇中,同时结合数据集的统计特征自适应地选取算法的密度参数来实现自动聚类。实验结果表明,UID-DBSCAN算法能够有效识别噪声,处理任意形状簇,具有较高的聚类精度和较低的计算复杂度。

关键词: 不确定性数据,区间数,聚类算法,DBSCAN

Abstract: The researches on clustering methods of uncertain data have been paid more and more attention,among them,the UIDK-means algorithm and U-PAM algorithm inherit the partition-based algorithm defects that can not identify any shape clusters and is sensitive to noise.FDBSCAN algorithm assumes that the probability distribution function or probability density function of uncertain data is known,however this information is hard to acquire.For the shortage of the above algorithms,a new multi-dimensional uncertain data clustering algorithm namely UID-DBSCAN based on interval numbers was proposed.It uses interval data combined with statistic information to describe uncertain data reaso-nably.And it utilizes the intervals distance function of low computing complexity to measure the similarity of different uncertain data.The concepts of interval density,interval density-reachable and interval density connected were firstly proposed and applied to expand clusters.Meanwhile in order to realize automatic clustering,combining with statistical features of the data,the parameters of density can be adaptively selected.Experiment results show that UID-DBSCAN algorithm can identify noise effectively,process arbitrary shape clusters and obtain better clustering precision with low computing complexity.

Key words: Uncertain data,Interval number,Clustering algorithm,DBSCAN

[1] 周傲英,金澈清,王国仁,等.不确定性数据管理技术研究综述[J].计算机学报,2009,2(1):1-16.
[2] 任世锦.基于区间数的不确定性数据挖掘及其应用研究[D].杭州:浙江大学,2006.
[3] 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61.
[4] CHAU M,CHENG R,KAO B,et al.Uncertain Data Mini ng:An Example in Clustering Location Data[C]∥The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining(PA-KDD 2006).Singapore:Springer-Verlag Berlin Heidelberg,2006:199-204.
[5] NGAI W K,KAO B,CHUI C K,et al.Efficient Clustering of Uncertain Data[C]∥Proceedings of the 22nd IEEE Internatio-nal Conference on Data Mining(ICDM 2006).Hong Kong:IEEE Computer Society,2006:436-445.
[6] YUN C,YANG J.Reducing UK-means to K-means[C]∥Proceedings of the 6th IEEE International Conference on Data Mi-ning(ICDM 2007).Washington:IEEE Computer Society,2007:483-488.
[7] GULLO F,POINT G,TAGAERLLI A.Clustering UncertainData Via K-medoids[C]∥Proceedings of the 2nd International Confe-rence on Scalable Uncertainty Management.Naples:Springer-Verlag Berlin Heidelberg,2008:229-242.
[8] KRIEGEL H P,PFEIFLE M.Density-based clustering of uncertain data[C]∥The 11th ACM SIGKDD International Confe-rence on Knowledge Discovery in Data Mining.Chicago:Illinois,2005:672-677.
[9] 许华杰,李国徽,杨宾,等.基于密度的不确定性数据概率聚类[J].计算机科学,2009,6(5):68-71.
[10] 胡春安,范丽文,毛伊敏.HPDBSCAN:高效的不确定数据处理算法[J].计算机工程与设计,2013,4(3):1044-1049.
[11] WANG H M,WANG Y Y,WAN S T.A Density-based Clustering Algorithm For Uncertain Data[C]∥Proceedings of International Conference on Computer Science and Electronics Engineering(ICCSEE 2012).Hangzhou:IEEE Computer Society,2012:102-105.
[12] ERDEM A,GNDEM T .M-FDBSCAN:A multicore density-based uncertain data clustering algorithm[J].Turkish Journal of Electrical Engineering & Computer Sciences,2014,2(1):143-154.
[13] JIANG B,PEI J,TAO Y F,et al.Clustering Uncertain Data Based on Probability Distribution Similarity[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(4):751-763.
[14] 彭宇,罗清华,彭喜元.UIDK-means:多维不确定性测量数据聚类算法[J].仪器仪表学报,2011,2(6):1201-1207.
[15] 何云斌,张志超,万静,等.不确定数据聚类的U-PAM算法和UM-PAM算法的研究[J].计算机科学,2016,3(6):263-269.
[16] 刘秀梅,赵克勤.区间数决策集对分析[M].北京:科学出版社,2014:1-28.
[17] 黄德才.数据仓库与数据挖掘教程[M].北京:清华大学出版社,2016.
[18] 戴阳阳,李朝锋,徐华.初始点优化与参数自适应的密度聚类算法[J].计算机工程,2016,2(1):203-209.
[19] AGGARWAL C C,YU P S.A Survey of Uncertain Data Algorithms and Applications[J].IEEE Transactions on Knowledge and Data Engineering,2009,1(5):609-623.
[20] DAVIES D L,BOULDIN D W.A Cluster Separation Measure[J].Transactions on Pattern Analysis and Machine Intelligence,1979(4):224-227.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!