计算机科学 ›› 2021, Vol. 48 ›› Issue (4): 111-116.doi: 10.11896/jsjkx.200800011

• 数据库&大数据&数据科学 • 上一篇    下一篇

一种基于符号关系图的快速符号数据聚类算法

张岩金1, 白亮1,2   

  1. 1 山西大学计算机与信息技术学院 太原030006
    2 山西大学计算机智能与中文信息处理教育部重点实验室 太原030006
  • 收稿日期:2020-06-24 修回日期:2020-08-05 出版日期:2021-04-15 发布日期:2021-04-09
  • 通讯作者: 白亮(bailiang@sxu.edu.cn)
  • 基金资助:
    国家自然科学基金(61773247,61876103);山西省基础研究计划(201901D211192)

Fast Symbolic Data Clustering Algorithm Based on Symbolic Relation Graph

ZHANG Yan-jin1, BAI Liang1,2   

  1. 1 School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2 Key Laboratory Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China
  • Received:2020-06-24 Revised:2020-08-05 Online:2021-04-15 Published:2021-04-09
  • About author:ZHANG Yan-jin,born in 1995,postgraduate.Her main research interests include categorical data clustering.(zhang17836204220@163.com)
    BAI Liang,born in 1982,Ph.D,professor,is a member of China Computer Federation.His main research interests include cluster analysis and so on.
  • Supported by:
    National Natural Science Foundation of China (61773247,61876103) and Technology Research Development Projects of Shanxi (201901D211192).

摘要: 由于在实际应用中有大量的符号数据生成,符号数据聚类成为了聚类分析的一个重要研究领域。目前,已有许多符号数据聚类算法被提出,但将它们应用于大数据环境时,仍然存在计算成本高、运行速度慢等问题。文中提出了一种基于符号关系图的快速符号数据聚类算法。该算法使用符号关系图替代原始数据,缩小数据集的规模,有效地解决了这一问题。大量的实验分析显示新算法相比其他算法是有效的。

关键词: 符号数据, 关系图, 聚类, 数据挖掘, 相似性度量

Abstract: Since a large amount of symbolic data is generated in practical applications,clustering of symbolicl data becomes an important research area of cluster analysis.Currently,many symbolic data clustering algorithms are proposed.When they are applied in big data environment,there are still problems such as high computational cost and slow operation speed.This paper proposes a fast symbolic data clustering algorithm based on symbolic relation graphs.It effectively solves this problem by replacing the original data with a symbolic relation graph and reducing the size of the data set.A large number of experiments show that the new algorithm is more effective than other algorithms.

Key words: Clustering, Data mining, Relation graph, Similarity measure, Symbolic data

中图分类号: 

  • TP391
[1]ZHOU Z H.Machine learning and its applications[M].Beijing:Tsinghua University Press,2009:15-20.
[2]ZHONG X,MA S P,ZHANG B,et al.A survey of data mining[J].Pattern Recognition and Artificial Intelligence,2001,3(1):50-57.
[3]JAIN A K,MURTY M N,FLYNN P J.Data clustering:a review[J].Acm Computing Surveys,1999,31(3):264-323.
[4]EL-SONBATY Y,ISMAIL M A.Fuzzy clustering for symbolic data[J].IEEE Transactions on Fuzzy Systems,1998,6(2):195-204.
[5]HUANG Z.Extensions to the k-Means Algorithm for Cluste-ring Large Data Sets with Categorical Values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304.
[6]WANG Z H,LIU S T,LUO Q.KNN Classification Algorithm based on improved K-modes clustering[J].Computer Engineering and Design,2019(8):2228-2234.
[7]SUDIPTO G,RAJEEV R,KYUSEOK S.Rock:A robust clusteringalgorithm for categorical attributes[J].Information Systems,2005(5):345-366.
[8]SHARMA S,SINGH M.Generalized similarity measure for cate-gorical data clustering[C]//2016 International Conference on Advances in Computing,Communications and Informatics(ICACCI).IEEE Press,2016:21-24.
[9]DING X,TAN J,WANG M.A categorical data clustering algorithm and its efficient parallel implementation[C]//2016 5th International Conference on Computer Science and Network Technology(ICCSNT).IEEE Press,2017:224-228.
[10]FISHE R,DOUGLAS H.Knowledge acquisitionvia incremental conceptual clustering[J].Machine Learning,1987,2(2):139-172.
[11]MICHALSKI R S,STEPP R E.Automated Construction ofClassifications Conceptual Clustering Versus Numerical Taxo-nomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1983,5(4):396-410.
[12]MAHAMADOU A J D,ANTOINE V,CHRISTIE G J,et al.Evidential clustering for categorical data[C]//2019 IEEE International Conference on Fuzzy Systems(FUZZ-IEEE).IEEE Press,2019:1-6.
[13]RALAMBONDRAINY H.A conceptual version of theK-means algorithm[J].Pattern Recognition Letters,1995,16(11):1147-1157.
[14]BARBARÁ D,LI Y,JULIA C.COOLCAT:an entropy-based algorithm for categorical clustering[C]//International Conference on Information and Knowledge Management.2002:582-589.
[15]GOWDA K C,RAVI T V.Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity[J].Pattern Recognition,1995,28(8):1277-1282.
[16]GOWDA K C,DIDAY E.Symbolic clustering using a new dissimilarity measure[M].Elsevier Science Inc.1991.
[17]DINESH M S,GOWDA K C,NAGABHUSHAN P.Unsupervised classification for remotely sensed data using fuzzy set theo-ry[C]//Geoscience and Remote Sensing(IGARSS ’97).IEEE Press,1997.
[18]NGUYEN T H T,HUYNH V N.A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure[C]//International Symposium on Foundations of Information & Knowledge Systems.Springer-Verlag New York,2016.
[19]JIA B,LIANG Y,SU H.An improvedK-Modesclustering algorithm[J].Software Guide,2019,18(6):60-64.
[20]MCDAID A F,GREENE D,HURLEY N.Normalized MutualInformation to evaluate overlapping community finding algorithms[J].arXiv:1110.2515.
[21]WARRENS M J.On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index[J].Journal of Classification,2008,25(2):177-183.
[22]YANG Y M.An Evaluation of Statistical Approaches to TextCategorization[J]. Proc. Amia. Annu. Fall. Symp.,1999,1(1/2):358-362.
[23]IAMON N,BOONGOEN T,GARRETT S,et al.A Link-Based Cluster Ensemble Approach for Categorical Data Clustering[J].IEEE Transactions on Knowledge andData Engineering,2012,24(3):413-425.
[24]STREHLA,GHOSH J.Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions[J].Journal of Machine Learning Research,2003,3(3):583-617.
[25]MICHAEL K,LI J J,HUANG Z X,et al.On the impact of dissimilarity measure in k-modes clustering algorithm[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(3):503-507.
[26]SAN O,HUYNH V,NAKAMORI Y.An alternative extension of the k-means algorithm for clustering categorical data[J].Pattern Recognition,2004,14(2):241-247.
[27]CHEN K,LIU L.“Best K”:critical clustering structures in categorical datasets[J].Knowledge and Information Systems,2009,20(1):1-33.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[3] 黎嵘繁, 钟婷, 吴劲, 周帆, 匡平.
基于时空注意力克里金的边坡形变数据插值方法
Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation
计算机科学, 2022, 49(8): 33-39. https://doi.org/10.11896/jsjkx.210600161
[4] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于DBSCAN聚类的集群联邦学习方法
Clustered Federated Learning Methods Based on DBSCAN Clustering
计算机科学, 2022, 49(6A): 232-237. https://doi.org/10.11896/jsjkx.211100059
[5] 郁舒昊, 周辉, 叶春杨, 王太正.
SDFA:基于多特征融合的船舶轨迹聚类方法研究
SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion
计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
[6] 毛森林, 夏镇, 耿新宇, 陈剑辉, 蒋宏霞.
基于密度敏感距离和模糊划分的改进FCM算法
FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition
计算机科学, 2022, 49(6A): 285-290. https://doi.org/10.11896/jsjkx.210700042
[7] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[8] 刘丽, 李仁发.
医疗CPS协作网络控制策略优化
Control Strategy Optimization of Medical CPS Cooperative Network
计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230
[9] 陈佳舟, 赵熠波, 徐阳辉, 马骥, 金灵枫, 秦绪佳.
三维城市场景中的小物体检测
Small Object Detection in 3D Urban Scenes
计算机科学, 2022, 49(6): 238-244. https://doi.org/10.11896/jsjkx.210400174
[10] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[11] 朱哲清, 耿海军, 钱宇华.
面向化学结构的线段聚类算法
Line-Segment Clustering Algorithm for Chemical Structure
计算机科学, 2022, 49(5): 113-119. https://doi.org/10.11896/jsjkx.210700131
[12] 张宇姣, 黄锐, 张福泉, 隋栋, 张虎.
基于菌群优化的近邻传播聚类算法研究
Study on Affinity Propagation Clustering Algorithm Based on Bacterial Flora Optimization
计算机科学, 2022, 49(5): 165-169. https://doi.org/10.11896/jsjkx.210800218
[13] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[14] 左园林, 龚月姣, 陈伟能.
成本受限条件下的社交网络影响最大化方法
Budget-aware Influence Maximization in Social Networks
计算机科学, 2022, 49(4): 100-109. https://doi.org/10.11896/jsjkx.210300228
[15] 杨旭华, 王磊, 叶蕾, 张端, 周艳波, 龙海霞.
基于节点相似性和网络嵌入的复杂网络社区发现算法
Complex Network Community Detection Algorithm Based on Node Similarity and Network Embedding
计算机科学, 2022, 49(3): 121-128. https://doi.org/10.11896/jsjkx.210200009
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!