计算机科学 ›› 2021, Vol. 48 ›› Issue (6A): 342-348.doi: 10.11896/jsjkx.201000053

• 智能计算 • 上一篇    下一篇

基于k-原型聚类和粗糙集的属性约简方法

李艳1,2, 范斌2, 郭劼2, 林梓源1, 赵曌1   

  1. 1 北京师范大学珠海分校应用数学学院 广东 珠海519087
    2 河北大学数学与信息科学学院 河北 保定 071002
  • 出版日期:2021-06-10 发布日期:2021-06-17
  • 通讯作者: 李艳(ly@hbu.edu.cn)
  • 基金资助:
    广东省自然科学基金(2018A0303130026);河北省自然科学基金(F2018201096);国家自然科学基金(61976141);河北省教育厅科学技术研究重点项目(ZD2019021)

Attribute Reduction Method Based on k-prototypes Clustering and Rough Sets

LI Yan1,2, FAN Bin2, GUO Jie2, LIN Zi-yuan1, ZHAO Zhao1   

  1. 1 School of Applied Mathematics,Beijing Normal University,Zhuhai,Zhuhai,Guangdong 519087,China
    2 College of Mathematics and Information Science,Hebei University,Baoding,Hebei 071002,China
  • Online:2021-06-10 Published:2021-06-17
  • About author:LI Yan,born in 1976,Ph.D,professor,master supervisor,is a member of China Computer Federation.Her main research interests include Granular computing and knowledge discovery and machine learning.
  • Supported by:
    NSF of Guangdong Province(2018A0303130026),NSF of Hebei Province(F2018201096),National Nautral Science Foundation of China(61976141) and Key Science and Technology Foundation of the Educational Department of Hebei Province(ZD2019021).

摘要: 基于k-原型聚类和等价关系下的粗糙集理论,对含有连续值和符号值的目标信息系统提出了一种新的适用于混合数据的属性约简方法。首先,k-原型聚类可以通过定义混合数据的距离而得到信息系统的类簇,形成对论域的划分。将所得到的类簇代替粗糙集理论中的等价类,提出基于聚类的近似集、正域以及正域约简的概念,并根据信息熵定义属性重要性度量,建立了变精度正域约简方法。这种属性约简可以同时处理数值型和符号型数据,去除其中的冗余属性,提高分类性能,降低存储和算法运行时间耗费,并通过调节聚类参数k得到对论域不同粒度的划分,对所得到的约简进行优化。最后在UCI数据集上进行了大量的实验,针对分类问题采用了常见的4种分类算法,比较了约简前后的分类精度,详细分析了参数对结果的影响,验证了约简方法的有效性。

关键词: k-原型聚类, 粗糙集, 多粒度, 混合数据, 属性约简

Abstract: For target information systems containing both continuous and symbolic values,a novel attribute reduction method is proposed based on k-prototypes clustering and rough set theory under equivalent relations,which is suitable for hybrid data.Firstly,k-prototypes clustering is applied to obtain clusters of information systems by defining the distance of hybrid data,forming a division of the universe.Then the obtained clusters are used to replace equivalent classes in rough set theory,and the concepts of cluster-based approximate set,positive region,attribute reduction are correspondingly proposed.An attribute importance measure is also defined based on information entropy and the clusters.Finally,a variable precision positive-region reduction method is established,which can process both numerical and symbolic data,remove redundant attributes,reduce the needed storage and running time cost,and improve classification performance of classification algorithms.Besides,the division of different granularities of the universe can be obtained by adjusting the clustering parameter k and thus the attributed reduction can be optimized.A large number of experiments are carried out on 11 UCI data sets,four common classification algorithms are used for classification problems.The classification accuracy before and after reduction are compared.The influence of parameters on the results is analyzed in detail and verifies the effectiveness of the reduction method.

Key words: k-prototypes clustering, Attribute reduction, Hybrid data, Multi-granule, Rough set

中图分类号: 

  • TP181
[1] PAWLAK Z.Rough sets[J].International Journal of Information & Computer Sciences,1982,11(3):289-296.
[2] PAWLAK Z.Rough sets:Theoretical Aspects of Reasoning about Data[M].Boston:Kluwer Academic Publishers,1991.
[3] SKOWRON A,RAUSZER C.The discernibility matrices andfunctions in information systems[M].Dordrecht:Springer,1992:331-362.
[4] KRYZKIEWICZ M.Comparative study of alternative types of knowledge reduction in inconsistent systems [J].International Journal of Intelligent Systems,2001,16(1):105-120.
[5] CHEN J,WANG G Y,HU J.Positive Domain Reduction Based on Dominance Relation in Inconsistent System[J].Computer Science,2008,35(3):216-218,227.
[6] LIU G,FENG Y,YANG J.A common attribute reduction form for information systems[J].Knowledge-Based Systems,2020,193:105466.
[7] GRECO S,MATARAZZO B,SLOWINSKI R.Rough sets theory for multicriteria decision analysis[J].European Journal of Operational Research,2001,129(1):1-47.
[8] GRECO S,MATARAZZO B,SLOWINSKI R.Rough approxi-mation by dominance relations[J].International Journal of Intelligent Systems,2002,17(2):153-171.
[9] CAO B R,LIU Y.Variable Precision Rough Set Model Based on Set Pair Situation Dominance Relationship[J].Computer Engineering,2015,41(11):35-40.
[10] LI Y,ZHANG L,WANG X J,et al.Attribute Reduction for Sequential Three-way Decisions Under Dominance-Equivalence Relations[J].Computer Science,2019,46(2):242-248.
[11] ANDERBERG M R.Cluster Analysis for Applications[M].New York:Academic Press,1973.
[12] SUN J G,LIU J,ZHAO L Y.Clustering algorithms research[J].Journal of Software,2008,19(1):48-61.
[13] LIU Y H,MA H F,LIU H J,et al.An overlapping subspace K-Means clustering algorithm[J].Computer Engineering,2020,46(8):58-63.
[14] HUANG Z.Extensions to the K-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304.
[15] HUANG Z,NG M.Fuzzy K-modes algorithm for clusteringcategorical data[J].IEEE Transactions on Fuzzy Systems,1999,7(4):446-452.
[16] CHEN Y,SONG J J,YANG X B.Accelerator for finding reduct based on attribute group[J].Journal of Nanjing University of Science and Technology,2020,44(2):216-223.
[17] CHEN Y,ZENG D S,XIE C.A Method of Attribute Reduction Based on Clustering[J].Computer Systems Applications,2009,18(5):173-176.
[18] LU J,ZHANG T,REN H L.Reduction of attribute in decision table based on clustering rate[J].Computer Engineering and Application,2012(28):135-138,233.
[19] CHEN Y C,LI O,SUN Y.Attribute reduction based on clustering discretization and variable precision neighborhood entropy[J].Control and Decision,2018,33(8):1407-1414.
[20] ZIARKO W.Variable precision rough set model[J].Journal of Computer and System Sciences,1993,46(1):39-59.
[21] UCI Machine Learning Repository[OL].https://archive.ics.uci.edu/ml/index.php.
[1] 秦琪琦, 张月琴, 王润泽, 张泽华.
基于知识图谱的层次粒化推荐方法
Hierarchical Granulation Recommendation Method Based on Knowledge Graph
计算机科学, 2022, 49(8): 64-69. https://doi.org/10.11896/jsjkx.210600111
[2] 程富豪, 徐泰华, 陈建军, 宋晶晶, 杨习贝.
基于顶点粒k步搜索和粗糙集的强连通分量挖掘算法
Strongly Connected Components Mining Algorithm Based on k-step Search of Vertex Granule and Rough Set Theory
计算机科学, 2022, 49(8): 97-107. https://doi.org/10.11896/jsjkx.210700202
[3] 张源, 康乐, 宫朝辉, 张志鸿.
基于Bi-LSTM的期货市场关联交易行为检测方法
Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM
计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[4] 许思雨, 秦克云.
基于剩余格的模糊粗糙集的拓扑性质
Topological Properties of Fuzzy Rough Sets Based on Residuated Lattices
计算机科学, 2022, 49(6A): 140-143. https://doi.org/10.11896/jsjkx.210200123
[5] 方连花, 林玉梅, 吴伟志.
随机多尺度序决策系统的最优尺度选择
Optimal Scale Selection in Random Multi-scale Ordered Decision Systems
计算机科学, 2022, 49(6): 172-179. https://doi.org/10.11896/jsjkx.220200067
[6] 杨斐斐, 沈思妤, 申德荣, 聂铁铮, 寇月.
面向数据融合的多粒度数据溯源方法
Method on Multi-granularity Data Provenance for Data Fusion
计算机科学, 2022, 49(5): 120-128. https://doi.org/10.11896/jsjkx.210300092
[7] 陈于思, 艾志华, 张清华.
基于三角不等式判定和局部策略的高效邻域覆盖模型
Efficient Neighborhood Covering Model Based on Triangle Inequality Checkand Local Strategy
计算机科学, 2022, 49(5): 152-158. https://doi.org/10.11896/jsjkx.210300302
[8] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[9] 王子茵, 李磊军, 米据生, 李美争, 解滨.
基于误分代价的变精度模糊粗糙集属性约简
Attribute Reduction of Variable Precision Fuzzy Rough Set Based on Misclassification Cost
计算机科学, 2022, 49(4): 161-167. https://doi.org/10.11896/jsjkx.210500211
[10] 王志成, 高灿, 邢金明.
一种基于正域的三支近似约简
Three-way Approximate Reduction Based on Positive Region
计算机科学, 2022, 49(4): 168-173. https://doi.org/10.11896/jsjkx.210500067
[11] 薛占熬, 侯昊东, 孙冰心, 姚守倩.
带标记的不完备双论域模糊概率粗糙集中近似集动态更新方法
Label-based Approach for Dynamic Updating Approximations in Incomplete Fuzzy Probabilistic Rough Sets over Two Universes
计算机科学, 2022, 49(3): 255-262. https://doi.org/10.11896/jsjkx.201200042
[12] 胡艳丽, 童谭骞, 张啸宇, 彭娟.
融入自注意力机制的深度学习情感分析方法
Self-attention-based BGRU and CNN for Sentiment Analysis
计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063
[13] 王栋, 周大可, 黄有达, 杨欣.
基于多尺度多粒度特征的行人重识别
Multi-scale Multi-granularity Feature for Pedestrian Re-identification
计算机科学, 2021, 48(7): 238-244. https://doi.org/10.11896/jsjkx.200600043
[14] 王政, 姜春茂.
一种基于三支决策的云任务调度优化算法
Cloud Task Scheduling Algorithm Based on Three-way Decisions
计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023
[15] 吕乐宾, 刘群, 彭露, 邓维斌, 王崇宇.
结合多粒度信息的文本匹配融合模型
Text Matching Fusion Model Combining Multi-granularity Information
计算机科学, 2021, 48(6): 196-201. https://doi.org/10.11896/jsjkx.200700100
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!