混合属性数据流的二重k近邻聚类算法

计算机科学 ›› 2013, Vol. 40 ›› Issue (10): 226-230.

混合属性数据流的二重k近邻聚类算法

黄德才,沈仙桥,陆亿红

浙江工业大学计算机科学与技术学院杭州310023;浙江工业大学计算机科学与技术学院杭州310023;浙江工业大学计算机科学与技术学院杭州310023

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
本文受农村水电效益分析与增效关键技术研究与示范,水利部公益性行业科研专项(201001031)资助

Double k-nearest Neighbors of Heterogeneous Data Stream Clustering Algorithm

HUANG De-cai,SHEN Xian-qiao and LU Yi-hong

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 现有的数据流聚类算法大都只能处理单一数值属性的数据,不能应对同时包含数值属性与分类属性特征的数据,且已存在的混合属性数据流聚类算法在对数据的标准化处理和聚类上还有很大的改进之处,为此,提出二重k近邻混合属性数据流聚类算法。该算法采用CluStream算法的在线、离线框架,并提出了混合属性数据流下三步聚类的思想。算法先运用二重k近邻和改进的维度距离生成微聚类,然后利用动态标准化数据方法和基于均值的余弦模型生成初始宏聚类,最后利用基于均值的余弦模型和先验聚类结果进行宏聚类优化。实验结果表明,所提出的算法具有良好的聚类质量及可扩展性。

关键词: 数据流,混合属性,聚类,二重k近邻

Abstract: On the one hand,most of the existing data stream clustering algorithm can handle data with numerical attri-bute,but can not cope with the data containing both numeric and classification attributes.On the other hand,there is also a lot of room for heterogeneous data stream algorithms to improve standardization and clustering of data．So,double k-nearest neighbors of heterogeneous data stream clustering algorithm was proposed．The algorithm uses CluStream’s online and offline framework with proposing three steps of clustering thought．Firstly,the algorithm uses double k-nearest neighbors and improved dimension distance to form micro clusters.Secondly,the algorithm uses dynamic standardization data method and cosine model based on mean value to form initial macro clusters.Thirdly,the algorithm uses cosine model based on mean value and priori clusters to do macro clustering optimization．Experimental results demonstrate that the proposed method improves clustering’s accuracy and scalability.

Key words: Data stream,Heterogeneous,Clustering,Double k-nearest neighbors

黄德才,沈仙桥,陆亿红. 混合属性数据流的二重k近邻聚类算法[J]. 计算机科学, 2013, 40(10): 226-230. https://doi.org/

HUANG De-cai,SHEN Xian-qiao and LU Yi-hong. Double k-nearest Neighbors of Heterogeneous Data Stream Clustering Algorithm[J]. Computer Science, 2013, 40(10): 226-230. https://doi.org/

参考文献

[1] 屠莉,陈崚,绉凌君.数据流的网格密度聚类算法[J].小型微型计算机系统,2009,0(7):1376-1383
[2] 王述云,胡运发,范颖捷,等．基于距离与熵的混合属性数据流聚类算法[J].小型微型计算机系统,2010,31(12):2365-2372
[3] Marques J P．Pattern recognition concepts,methods and applications[M]．Beijing:Tsinghua University Press,2002:51-74
[4] Huang Z．Extensions to the K-means algorithm for clustering large datasets with categorical values[J]．Data Mining and Knowledge Discovery II,1998(2):283-304
[5] Huang Z,Ma N G．Fuzzy K-modes algorithm for clustering categorical data[J]．IEEE Transactions on Fuzzy Systems,1999,7(4):446-452
[6] Aggarwal C,Han J,Wang J,et a1．A Framework for Clustering Evolving Data Streams[C]∥Proceedings of 29th Very Large Data Bases Conference．2003,81-92
[7] Aggarwal C C,Yu P S．A framework for clustering massive text and categorical data st reams[C]∥Proc of the 6th SIAM Int Conf on Data Mining．Bethesda,2006:477-481
[8] 杨春宇,周杰.一种混合属性数据流聚类算法[J].计算机学报,2007,0(8):1364-1372
[9] Hsu C C,Huang Y．Incremental clustering of mixed data based on distance hierarchy[J]．Expert Systems with Applications,2008,35(3):1177-1185
[10] 黄德才,吴天虹.基于密度的混合属性数据流聚类算法[J].控制与决策,2010,5(3):416-422
[11] 刘青宝,邓苏,张维明.基于相对密度的聚类算法[J].计算机科学,2007,4(2):192-196
[12] 李桃迎,陈燕,张金松,等.基于聚类融合的混合属性数据增量聚类算法[J].控制与决策,2010,7(4):603-609
[13] 周津,陈超,俞能海.采用对象特征向量表示法的标签聚类算法[J].小型微型计算机系统,2012,3(3):525-531

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed