计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 37-47.doi: 10.11896/jsjkx.230600038

• 粒计算与知识发现 • 上一篇    下一篇

基于粗糙集与密度峰值聚类的特征选择算法

曹栋涛1, 舒文豪1, 钱进2   

  1. 1 华东交通大学信息工程学院 南昌330013
    2 华东交通大学软件学院 南昌330013
  • 收稿日期:2023-06-04 修回日期:2023-07-28 出版日期:2023-10-10 发布日期:2023-10-10
  • 通讯作者: 舒文豪(shuwenhao@126.com)
  • 作者简介:(1767831966@qq.com)
  • 基金资助:
    国家自然科学基金(62266018,61966016);江西省自然科学基金(20202BABL202037,20232ACB202013,20232BAB202052);江西省研究生创新基金项目(YC2022-s547)

Feature Selection Algorithm Based on Rough Set and Density Peak Clustering

CAO Dongtao1, SHU Wenhao1, QIAN Jin2   

  1. 1 School of Information Engineering,East China Jiaotong University,Nanchang 330013,China
    2 School of software,East China Jiaotong University,Nanchang 330013,China
  • Received:2023-06-04 Revised:2023-07-28 Online:2023-10-10 Published:2023-10-10
  • About author:CAO Dongtao,born in 1997,master.His main research interests include machine learning,data mining,rough set,etc.SHU Wenhao,born in 1985,Ph.D,associate professor,master supervisor.Her main research interests include data mining,knowledge discovery,rough set,etc.
  • Supported by:
    National Natural Science Foundation of China(62266018,61966016),Jiangxi Province Natural Science Foundation(20202BABL202037,20232ACB202013,20232BAB202052) and Jiangxi Postgraduate Innovation Fund Project(YC2022-s547).

摘要: 特征选择可以有效地去除高维数据中的冗余和不相关的特征,保留重要的特征,从而降低模型计算的复杂性,提高模型精度。在特征选择过程中,针对数据中存在的离群点和边界点等可能影响分类效果的噪声数据,提出了基于粗糙集与密度峰值聚类的特征选择方法。首先,通过密度峰值聚类方法去除噪声数据,并挑出簇类中心;然后,结合粗糙集理论的思想,按簇类中心划分数据,并根据同一簇类的点应具有相同标签的假设,定义特征重要性评价指标;最后,设计了一种启发式特征选择算法,用于挑选出使簇类结构纯度更高的特征子集。在6个UCI数据集上,与其他算法进行了分类精度、特征选择个数和运行时间的对比实验,实验结果验证了所提算法的有效性和高效性。

关键词: 特征选择, 高维数据, 噪声数据, 粗糙集, 密度峰值聚类

Abstract: Feature selection can effectively remove redundant and irrelevant features from high-dimensional data and retain important features,thus reducing the complexity of model computation and improving model accuracy.While in feature selection process,to deal with these noisy data that may affect the classification effect,such as outlier points and boundary points,a feature selection method based on rough set and density peak clustering is proposed.At first,noisy data are removed by density peak clustering method and cluster class centers are picked out.Then,the data are divided by cluster class centers by combining the idea of rough set theory,and the feature importance evaluation measure is defined according to the assumption that the data points of same cluster have same label.Finally,a heuristic feature selection algorithm is designed to pick up the feature subset that can makes for a purer homogeneous cluster structure.Experimental comparisons of classification accuracy,number of selected features and running time are conducted with other algorithms on six UCI datasets,and the experimental results verify the effectiveness and efficiency of the proposed algorithm.

Key words: Feature selection, High-dimensional data, Noisy data, Rough sets, Density peak clustering

中图分类号: 

  • TP391
[1]JING Y G,JING L X,WANG B L,et al.Incremental attribute reduction algorithm for attribute values and attribute changes[J].Journal of Shandong University:Science Edition,2020,55(1):62-68.
[2]WANG C Z,HUANG Y,SHAO M W,et al.Feature SelectionBased on Neighborhood Self-Information[J].IEEE Transactions on Cybernetics,2020,50(9):4031-4042.
[3]WANG Q,QIAN Y H,LIANG X Y,et al.Local neighborhood rough set[J].Knowledge-Based Systems,2018,153:53-64.
[4]WANG D,CHEN H M,LI T R,et al.A novel quantum grasshopper optimization algorithm for feature selection[J].International Journal of Approximate Reasoning,2020,127:33-53.
[5]PAWLAK Z.Rough set[J].International Journal of Computer and Information Sciences,1982,11(5):341-356.
[6]LIU Y,CHENG L,SUN L.Feature selection method based on K-S test and neighborhood rough set[J].Journal of Henan Normal University:Natural Science Edition,2019,47(2):21-28.
[7]XUE Z A,PANG W L,YAO S Q,et al.Intuitionistic fuzzy three-branch decision-making model based on prospect theory[J].Journal of Henan Normal University:Natural Science Edition,2020,48(5):31-36,79.
[8]YANG X L,CHEN H M,LI T R,et al.Neighborhood rough setswith distance metric learning for feature selection[J].Know-ledge-Based Systems,2021,224:107076.
[9]MARIELLO A,BATTITI R.Feature Selection Based on theNeighborhood Entropy[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):6313-6322.
[10]WANG C Z,HE Q,SHAO M W,et al.Feature selection based on maximal neighborhood discernibility[J].International Journal of Machine Learning & Cybernetics,2019,9(11):1929-1940.
[11]HU Q H,ZHAO H,YU D R.Fast reduction algorithm of symbolic and numerical attributes based on neighborhood rough sets[J].Pattern Recognition and Artificial Intelligence,2008,21(6):730-738.
[12]SHENG K,WANG W,BIAN X F,et al.Neighborhood discrimination incremental attribute reduction algorithm for mixed data[J].Acta Electronica,2020,48(4):682-696.
[13]RODRIGUEZ A,LAIO A.Clustering by fast searchand find of density peaks[J].Science,2014,344(6191):1492-1496.
[14]ZOU X H,YE X D,TAN Z Y.A color image segmentationmethod based on density peak clustering[J].Microcomputer System,2017,38(4):868-871.
[15]HUANG L,LI Y,WANG G S,et al.Community discoverymethod based on point distance and density peak clustering[J].Journal of Jilin University:Engineering Edition,2016,46(6):2042-2051.
[16]DU M,DING S,XU X,et al.Density peaks clustering using geodesic distances[J].International Journal of Machine Learning & Cybernetics,2018,9(8):1355-1349.
[17]BIAN Z K,CHUNG F L,WANG S T.Fuzzy Density Peaks Clustering[J].IEEE Transactions on Fuzzy Systems,2021,29(7):1725-1738.
[18]LIU R,HUANG W,FEI Z,et al.Constraint-based clustering by fast search and find of density peaks[J].Neurocomputing,2019,330:223-237.
[19]XUE X N,GAO S P,PENG H M,et al.Density peak clusteringalgorithm based on K nearest neighbor and multi-class merging[J].Journal of Jilin University:Science Edition,2019,57(1):111-120.
[20]Rosetta:A rough set toolkit for analysis of data [OL].http://www.lcb.uu.se/tools/rosetta/index.php.
[21]HU Q H,YU D R,LIU J F,et al.Neighborhood rough set based heterogeneous feature subset selection[J].Information Sciences,2008,178(18):3577-3594.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!