计算机科学 ›› 2021, Vol. 48 ›› Issue (1): 136-144.doi: 10.11896/jsjkx.200700213

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于差异性度量的基础聚类三支过滤算法

梁伟1,2, 段晓东1, 徐健锋1,3,4   

  1. 1 南昌大学软件学院 南昌 330047
    2 华南理工大学软件学院 广州 510006
    3 同济大学电子与信息工程学院 上海 201804
    4 泰豪软件股份有限公司 南昌 330096
  • 收稿日期:2020-07-31 修回日期:2020-08-30 出版日期:2021-01-15 发布日期:2021-01-15
  • 通讯作者: 徐健锋(jianfeng_x@ncu.edu.cn)
  • 作者简介:416627317007@email.ncu.edu.cn
  • 基金资助:
    国家自然科学基金项目(61763031);江西省自然科学基金资助项目(20202BAB202018)

Three-way Filtering Algorithm of Basic Clustering Based on Differential Measurement

LIANG Wei1,2, DUAN Xiao-dong1, XU Jian-feng1,3,4   

  1. 1 School of Software,Nanchang University,Nanchang 330047,China
    2 School of Software Engineering,South China University of Technology,Guangzhou 510006,China
    3 College of Electronics and Information Engineering,Tongji University,Shanghai 201804,China
    4 Tellhow Software Co.,LTD,Nanchang 330096,China
  • Received:2020-07-31 Revised:2020-08-30 Online:2021-01-15 Published:2021-01-15
  • About author:LIANG Wei,born in 1993,Ph.D candidate,is a student member of China Computer Federation.His main research interests include machine lear-ning,granular computing,three-way decision and ensemble clustering.
    XU Jian-feng,born in 1973,Ph.D candidate,professor,is a member of China Computer Federation.His main research interests include data mining,rough set,three-way decision and machine learning.
  • Supported by:
    National Natural Science Foundation of China(61763031) and Jiangxi Provincial Natural Science Foundation(20202BAB202018).

摘要: 基础聚类成员预处理是聚类集成算法中的一个重要研究步骤。众多研究表明,基础聚类成员集合的差异性会影响聚类集成算法性能。当前聚类集成研究围绕着生成基础聚类和优化集成策略展开,而针对基础聚类成员的差异性度量及其优化的研究尚不完善。文中基于Jaccard相似性提出一种基础聚类成员差异性度量指标,并结合三支决策思想提出了基础聚类成员差异性三支过滤方法。该方法首先设定基础聚类成员的三支决策的初始阈值α(0)和β(0),然后计算各个基础聚类成员的差异性度量指标,进而实施三支决策。其决策策略为:当基础聚类成员的差异性度量指标小于指定阈值α(0)时,删除该基础聚类成员;当基础聚类成员的差异性度量指标大于指定阈值β(0)时,保留该基础聚类成员;当基础聚类成员的差异性度量指标大于α(0)且小于β(0)时,该基础聚类成员被归入三支决策边界域等待进一步判断。当结束一轮三支决策后,算法将重新计算三支决策阈值α(1)和β(1)并对上轮三支决策边界域重新进行三支决策,直至没有基础聚类成员被归入三支决策边界域或达到指定迭代次数。对比实验表明基础差异性度量的基础聚类三支过滤方法能够有效地提升聚类集成效果。

关键词: 差异性度量, 基础聚类过滤, 聚类集成, 三支决策, 三支优化

Abstract: The pre-processing of basic clustering members is an important research step in the ensemble clustering algorithm.Numerous studies have shown that the difference in the set of basic clustering members affects the performance of the ensemble clustering.The current ensemble clustering research revolves around the generation of basic clustering and the integration of basic clustering,while the differential measurement and optimization of basic clustering members are not perfect.Based on Jaccard's similarity,this study proposes a measurement for the differential of basic clustering members and constructs a differential three-way filtering method for basic clustering members by introducing the three-way decisions idea.This method first sets the initial thresholds α(0) and β(0) of the three-way decisions for basic clustering members and then calculates the differential of each basic clustering member to implement the three-way decisions.Its decision strategy is:when the differential metric of the basic clustering member is less than the specified threshold α(0),the basic clustering member will be deleted; when the differential metric of the basic clustering member is greater than the specified threshold β(0),the basic clustering member will be retained; and when the differential metric of the basic clustering member is greater than α(0)and less than β(0),the basic clustering member will be added into the boundary domain of the three-way decisions,and boundary domains will be further judged by the three-way decisions with new thresholds.After completing a round of the three decisions,the algorithm recalculates thresholds of the three-way decisions and remakes the three-way decisions on boundary domains of the three-way decisions remained in the last round until no basic clustering member is added to boundary domains of the three-way decisions or the specified number of iterations is reached.The comparative experiment shows that the differential measurement three-way filtering method for basic clustering can effectively improve the performance of ensemble clustering.

Key words: Basic clustering filtering, Clustering ensemble, Differential measurement, Three-way decision, Three-way optimization

中图分类号: 

  • TP18
[1] HUANG D,LAI J H,WANG C D.Combining multiple cluste-rings via crowd agreement estimation and multi-granularity link analysis[J].Neurocomputing,2015,170:240-250.
[2] ZHOU Z H.Ensemble Methods-Foundations and Algorithms [M].Taylor&Francis,2013,81(3):470-470.
[3] TOPCHY A,JAIN A K,PUNCH W.Clustering ensembles:models of consensus and weak partitions[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(12):1866-1881.
[4] STREHL A,GHOSH J.Cluster ensembles:A knowledge reuse framework for combining multiple partitions.[J].Journal of Machine Learning Research,2002,3(12):583-617.
[5] FRED A L,JAIN A K.Combining multiple clusterings using evi-dence accumulation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
[6] FERN X Z,BRODLEY C E.Random projection for high dimensional data clustering:A cluster ensemble approach[C]//Proceedings of 20th International Conference on Machine Learning.2003:186-193.
[7] APPROACH C E,FERN X Z,BRODLEY C E.Random Projection for High Dimensional Data Clustering[C]//Twentieth International Conference on International Conference on Machine Learning.AAAI Press,2003.
[8] MINAEIBIDGOLI B,TOPCHY A,PUNCH W F.Ensembles of partitions via data resampling[C]//International Conference on Information Technology:Coding & Computing.IEEE Computer Society,2004.
[9] DUDOIT S,FRIDLYAND J.Bagging to improve the accuracy of a clustering procedure[J].Bioinformatics,2003,19(9):1090-1099.
[10] YANG Y,JIANG J.Hybrid sampling-based clustering ensemble with global and local constitutions[J].IEEE Transactions on Neural Networks and Learning Systems,2016,27(5):952-965.
[11] ZHOU P,DU L,SHI L,Wang H,et al.Learning a robust consensus matrix for clustering ensemble via kullback-leibler divergence minimization[C]//Proc.the 25th International Joint Conference on Artificial Intelligence.2015.
[12] YU Z,LUO P,YOU J,et al.Incremental semi-supervised clustering ensemble for high dimensional data clustering[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(3):701-714.
[13] YU Z,LI L,LIU J,et al.Adaptive noise immune cluster ensemble using affinity propagation[J].IEEE Transactions on Know-ledge and Data Engineering,2015,27(12):3176-3189.
[14] FROUZAN R,SAMAD N,HAMID P,et al.Dibversity Based Cluster Weighting In Cluster Ensemble:An Information Theory Approach.[J].Artificial Intelligence Review,2019,52(2):1341-1368.
[15] WANG T. CA-Tree:A hierarchical structure for efficient and scalable coassociation-based cluster ensembles[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2011,41(3):686-698.
[16] TUMER K,AGOGINO A K.Ensemble clustering with voting active clusters[J].Pattern Recognition Letters,2008,29(14):1947-1953.
[17] HUANG D,WANG C D,LAI J H.Locally Weighted Ensemble Clustering[J].IEEE Transactions on Cybernetics,2016,48(5):1460-1473.
[18] HONG Y,YUN C,PAWAN L,et al.A three-way cluster ensemble approach for large-scale data[J].International Journal of Approximate Reasoning,2019,115:32-49.
[19] KANG Q,LIU S Y,ZHOU M C,et al.A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence[J].Knowledge Based Systems,2016,104(Jul):156-164.
[20] LIANG W,ZHANG Y J,XU J F,et al.Optimization of Basic Clustering for Ensemble Clustering:An Information-Theoretic Perspective[J].IEEE Access,2019,7:179048-179062.
[21] HUANG D,WANG C,PENG H,et al.Enhanced ensemble clus-tering via fast propagation of cluster-wise similarities[J].IEEE Trans.Syst.Man,Cybern.,Syst.2019,11:1-12.
[22] PARVIN H AND MINAEI-BIDGOLI B.A clustering ensemble framework based on selection of fuzzy weighted clusters in a locally adaptive clustering algorithm[J].Pattern Anal.Appl.,2015,18(1):87-112.
[23] SONG J H.Research on clustering integration algorithm [D].Harbin:Harbin Engineering University,2015.
[24] NIWATTANAKUL S,SINGTHONGCHAI J,NAENUDO-RNE,et al.Using of Jaccard Coefficient for Keywords Similarity[C]//Iaeng International Conference on Internet Computing & Web Services.International Association of Engineers,2013.
[25] IAM-ON N,BOONGEON T,GARRETT S,et al.A Link-Based Cluster Ensemble Approach for Categorical Data Clustering[J].IEEE Transactions on Knowledge and Data Engineering,2012,24(3):413-425.
[26] LUO H L,KONG F S,LI Y X.An Analysis of Diversity Mea-sures in Clustering Ensembles[J].Chinese Journal of Compu-ters,2007,30(8):1315-1324.
[27] NATTHAKAN I,GARRETT S.LinkCluE:A MATLAB pac-kage for link based cluster ensembles[J].Stat.Softw.,2010,36(9):1-36.
[28] PARVIN H,MINAEI-BIDGOLI B.A clustering ensembleframework based on elite selection of weighted clusters[J].Adv.Data Anal.Classification,2013,7(2):181-208.
[29] YU Z,LUO P,YOU J,et al.Incremental semi-supervised clustering ensemble for high dimensional data clustering[J].IEEE Trans.Knowl.Data Eng.2016,28(3):701-714.
[30] FERN X,BRODLEY C.Solving cluster ensemble problems by bipartite graph partitioning[C]//Proc.Int.Conf.Mach.Learn.,2004:36.
[31] DOMENICONI C,AL-RAZGAN M.Weighted cluster ensem-bles:Methods and analysis[J].ACM Trans.Knowl.Discovery Data,2009:2-17.
[32] HUANG D,LAI J,WANG C.Robust ensemble clustering using probability trajectories[J].IEEE Trans.Knowl.Data Eng.,2016,28(5):1312-1326.
[33] GREENE D,TSYMBAL A,BOLSHAKOVA N,et al.Ensemble Clustering in Medical Diagnostics[C]//17th IEEE Symposium on Computer-Based Medical Systems,2004(CBMS 2004).IEEE,2004.
[34] HADJITODOROV S T,KUNCHEVA L I,TODOROVA L P. Moderate diversity for better cluster ensembles[J].Information Fusion,2006,7(3):268-275.
[35] YAO Y.Decision-theoretic rough set models[C]//International Conference on Rough Sets and Knowledge Technology.Springer-Verlag,2007:1-12.
[36] QIAN Y H,ZHANG H,SANG Y L,et al.Multi-granulation decision theoretic rough sets[J].International Journal of Approximate Reasoning,2014,55(1):225-237.
[37] MIAO D,XU F,YAO Y,et al.Set theory description of particle calculation[J].Journal of Computer,2012,35 (2):351-363.
[38] ABUALIGAH L M,KHADER A T,AL-BETAR M A,et al.Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering[J].Expert Systems with Applications,2017,84:24-36.
[39] STREHL A,GHOSH J. Cluster ensembles:A knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2003,3(12):583-617.
[40] LU Z,PENG Y,IP H H S.Combining multiple clusterings using fast simulated annealing.[J].Pattern Recognition Letters,2011,32(15):1956-1961.
[1] 王志成, 高灿, 邢金明.
一种基于正域的三支近似约简
Three-way Approximate Reduction Based on Positive Region
计算机科学, 2022, 49(4): 168-173. https://doi.org/10.11896/jsjkx.210500067
[2] 张师鹏, 李永忠.
基于降噪自编码器和三支决策的入侵检测方法
Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions
计算机科学, 2021, 48(9): 345-351. https://doi.org/10.11896/jsjkx.200500059
[3] 王政, 姜春茂.
一种基于三支决策的云任务调度优化算法
Cloud Task Scheduling Algorithm Based on Three-way Decisions
计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023
[4] 辛现伟, 史春雷, 韩雨琦, 薛占熬, 宋继华.
基于三支决策的增量标签传播算法
Incremental Tag Propagation Algorithm Based on Three-way Decision
计算机科学, 2021, 48(11A): 102-105. https://doi.org/10.11896/jsjkx.210300065
[5] 薛占熬, 张敏, 赵丽平, 李永祥.
集对优势关系下多粒度决策粗糙集的可变三支决策模型
Variable Three-way Decision Model of Multi-granulation Decision Rough Sets Under Set-pair Dominance Relation
计算机科学, 2021, 48(1): 157-166. https://doi.org/10.11896/jsjkx.191200175
[6] 陈玉金, 徐吉辉, 史佳辉, 刘宇.
基于直觉犹豫模糊集的三支决策模型及其应用
Three-way Decision Models Based on Intuitionistic Hesitant Fuzzy Sets and Its Applications
计算机科学, 2020, 47(8): 144-150. https://doi.org/10.11896/jsjkx.190800041
[7] 邵超, 马进家.
基于Xie-Beni指数的选择性聚类集成
Selective Clustering Ensemble Based on Xie-Beni Index
计算机科学, 2020, 47(6A): 457-460. https://doi.org/10.11896/JsJkx.190700044
[8] 向伟, 王新维.
基于多类邻域三支决策模型的不平衡数据分类
Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision
计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099
[9] 李艳, 张丽, 陈俊芬.
动态信息系统中基于序贯三支决策的属性约简方法
Attribute Reduction Method Based on Sequential Three-way Decisions in Dynamic Information Systems
计算机科学, 2019, 46(6A): 120-123.
[10] 薛占熬, 韩丹杰, 吕敏杰, 赵丽平.
一种新的基于粒度重要度的三支决策模型
New Three-way Decisions Model Based on Granularity Importance Degree
计算机科学, 2019, 46(2): 236-241. https://doi.org/10.11896/j.issn.1002-137X.2019.02.036
[11] 李艳, 张丽, 王雪静, 陈俊芬.
优势-等价关系下序贯三支决策的属性约简
Attribute Reduction for Sequential Three-way Decisions Under Dominance-Equivalence Relations
计算机科学, 2019, 46(2): 242-148. https://doi.org/10.11896/j.issn.1002-137X.2019.02.037
[12] 郭豆豆, 姜春茂.
基于M-3WD的多阶段区域转化策略研究
Multi-stage Regional Transformation Strategy in Move-based Three-way Decisions Model
计算机科学, 2019, 46(10): 279-285. https://doi.org/10.11896/jsjkx.180801609
[13] 徐健锋, 何宇凡, 刘斓.
三支决策代价目标函数的关系及推理研究
Relationship and Reasoning Study for Three-way Decision Cost Objective Functions
计算机科学, 2018, 45(6): 176-182. https://doi.org/10.11896/j.issn.1002-137X.2018.06.031
[14] 陈玉金, 李续武, 邢瑞康.
基于证据理论的三支决策模型
Three-way Decisions Model Based on Evidence Theory
计算机科学, 2018, 45(6): 241-246. https://doi.org/10.11896/j.issn.1002-137X.2018.06.043
[15] 薛占熬,辛现伟,袁艺林,吕敏杰.
基于直觉模糊可能性分布的三支决策模型的研究
Study on Three-way Decisions Based on Intuitionistic Fuzzy Probability Distribution
计算机科学, 2018, 45(2): 135-139. https://doi.org/10.11896/j.issn.1002-137X.2018.02.024
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!