计算机科学 ›› 2020, Vol. 47 ›› Issue (1): 87-95.doi: 10.11896/jsjkx.181202320

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于差别矩阵和mRMR的分步优化特征选择算法

樊鑫1,陈红梅2   

  1. (西南交通大学信息科学与技术学院 成都611756)1;
    (西南交通大学云计算与智能技术高校重点实验室 成都611756)2
  • 收稿日期:2018-12-14 发布日期:2020-01-19
  • 通讯作者: 陈红梅(hmchen@swjtu.edu.cn)
  • 基金资助:
    国家自然科学基金(61572406)

Stepwise Optimized Feature Selection Algorithm Based on Discernibility Matrix and mRMR

FAN Xin1,CHEN Hong-mei2   

  1. (School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China)1;
    (Key Laboratory of Cloud Computing and Intelligent Technology,Southwest Jiaotong University,Chengdu 611756,China)2
  • Received:2018-12-14 Published:2020-01-19
  • About author:FAN Xin,born in 1994,postgraduate.His main research interests include the areas of data mining and pattern reco-gnition;CHEN Hong-mei,born in 1971,Ph.D,professor,Ph.D supervisor,is member of China Computer Federation (CCF).Her main research interests include rough set,granular computing and intelligent information processing.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61572406).

摘要: 分类问题普遍存在于现代工业生产中。在进行分类任务之前,利用特征选择筛选有用的信息,能够有效地提高分类效率和分类精度。最小冗余最大相关算法(mRMR)考虑最大化特征与类别的相关性和最小化特征之间的冗余性,能够有效地选择特征子集;但该算法存在中后期特征重要度偏差大以及无法直接给出特征子集的问题。针对该问题,文中提出了结合邻域粗糙集差别矩阵和mRMR原理的特征选择算法。根据最大相关性和最小冗余性原则,利用邻域熵和邻域互信息定义了特征的重要度,以更好地处理混合数据类型。基于差别矩阵定义了动态差别集,利用差别集的动态演化有效去除冗余属性,缩小搜索范围,优化特征子集,并根据差别矩阵判定迭代截止条件。实验选取SVM,J48,KNN和MLP作为分类器来评价该特征选择算法的性能。在公共数据集上的实验结果表明,与已有算法相比,所提算法的平均分类精度提升了2%左右,同时在特征较多的数据集上能够有效地缩短特征选择时间。所提算法继承了差别矩阵和mRMR的优点,能够有效地处理特征选择问题。

关键词: mRMR, 差别矩阵, 邻域粗糙集, 特征选择

Abstract: Classification is a common problem in modern industrial production.The classification efficiency and classification accuracy can be improved effectively by using feature selection to filter useful information before classifying tasks.Considering the maximum correlation between features and class and the minimum redundancy among features,the minimal-redundancy-maximal-relevance(mRMR) algorithm can effectively select feature subset.However,two problems exist in the algorithm mRMR,i.e.,the relatively large deviation of the importance of the features in the middle and later stages of mRMR,and the feature subset is not given directly.A novel algorithm based on the principle of mRMR and discernibility matrix of neighborhood rough set was proposed,to solve these problems.The significance of the feature is defined by employing neighborhood entropy and neighborhood mutual entropy based on the principle of the minimal redundancy and maximal relevance,which can deal the mixed type date better.Dynamic discernibility set is defined based on the discernibility matrix.The dynamic evolution of the discernibility set is utilized as the policy to delete redundant features and narrows search range.The optimized feature subset is given when the iteration is stop by the stop condition given by discernibility matrix.In this paper,SVM,J48,KNN and MLP were selected as classi-fiers to evaluate the performance of the feature selection algorithm.The experimental results on the public datasets show that the average classification accuracy of the proposed algorithm is about 2% more than that of previous algorithm,and the proposed algorithm can effectively shorten the feature selection time on the data set with more features.Therefore,the proposed algorithm inherits the advantages of discernibility matrix and MRMR,and can effectively deal with feature selection problems.

Key words: Discernibility matrix, Feature selection, mRMR, Neighborhood rough set

中图分类号: 

  • TP301.6
[1]CHANDRASHEKAR G,SAHIN F.A survey on feature selection methods[J].Computers and Electrical Engineering,2014,40(1):16-28.
[2]XE J Y,WANG M Z,ZHOU Y,et al.Differentially expressed gene selection algorithms for unbalanced gene datasets [J].Chinese Journal of Computers,2019,42(6):1232-1250.
[3]WANG D,NIE F,HUANG H.Feature Selection via Global Redundancy Minimization[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(10):2743-2755.
[4]TASKIN G,KAYA H,BRUZZONE L.Feature selection based on high dimensional model representation for hyperspectral images[J].IEEE Transactions on Image Processing,2017,26(6):2918-2928.
[5]SHEIKHPOUR R,SARRAM M A,GHARAGHANI S,et al.A Survey on semi-supervised feature selection methods[J].Pattern Recognition,2016,64(C):141-158.
[6]MAFARJA M,MIRJALILI S.Whale optimization approaches for wrapper feature selection[J].Applied Soft Computing Journal,2018,62:441-453.
[7]VIOLA M,SANGIOVANNI M,TORALDO G,et al.A genera- lized eigenvalues classifier with embedded feature selection[J].Optimization Letters,2017,11(2):299-311.
[8]PENG H,LONG F,DING C.Feature selection based on mutual information:Criteria of Max-Dependency,Max-Relevance,and Min-Redundancy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1226-1238.
[9]ESTEVEZ P A,TESMER M,PEREZ C A,et al.Normalized mutual information feature selection[J].IEEE Transactions on Neural Networks,2009,20(2):189-201.
[10]SUN Z,ZHANG J,DAI L,et al.Neurocomputing Mutual information based multi-label feature selection via constrained convex optimization[J].Neurocomputing,2019,329:447-456.
[11]WANG J,WEI J M,YANG Z,et al.Feature Selection by Maximizing Independent Classification Information[J].IEEE Transactions on Knowledge & Data Engineering,2017,29(4):828-841.
[12]PAWLAK Z.Rough sets[J].International Journal of Computer & Information Sciences,1982,11(5):341-356.
[13]LIU Y,CHEN Y,TAN K,et al.Maximum relevance,minimum redundancy band selection based on neighborhood rough set for hyperspectral data classification[J].Measurement Science and Technology,2016,27(12):125501.
[14]YONG L,WENLIANG H,YUNLIANG J,et al.Quick attribute reduct algorithm for neighborhood rough set model[J].Information Sciences,2014,271:65-81.
[15]JIANG Y,YU Y.Minimal attribute reduction with rough set based on compactness discernibility information tree[J].Soft Computing,Springer Berlin Heidelberg,2016,20(6):2233-2243.
[16]DUBOIS D,PRADE H.Rough fuzzy sets and fuzzy rough sets[J].International Journal of General Systems,1990,17(2/3):191-209.
[17]Hu Q H,YU D R,XIE Z X.Numerical Attribute Reduction Based on Neighborhood Granulation and Rough Approximattion [J].Journal of Software,2008,19(3):640-649.
[18]SKOWRON A,RAUSZER C.The Discernibility Matrices and Functions in Information Systems[M]∥Intelligent Decision Support.Dordrecht:Springer,1992:331-362.
[19]DAI J.Rough set approach to incomplete numerical data[J].Information Sciences,2013,241:43-57.
[20]LI M,SHANG C,FENG S,et al.Quick attribute reduction in inconsistent decision tables[J].Information Sciences,2014,254:155-180.
[21]YANG M,YANG P.A novel condensing tree structure for rough set feature selection[J].Neurocomputing,2008,71(4/5/6):1092-1100.
[22]HU Q,ZHANG L,ZHANG D,et al.Measuring relevance between discrete and continuous features based on neighborhood mutual information[J].Expert Systems with Applications,2011,38(9):10737-10750.
[23]LIN Y,HU Q,LIU J,et al.Multi-label feature selection based on neighborhood mutual information[J].Applied Soft Computing,2016,38:244-256.
[24]HU Q,ZHANG L,ZHANG D,et al.Measuring relevance be- tween discrete and continuous features based on neighborhood mutual information[J].Expert Systems with Applications,2011,38(9):10737-10750.
[25]BERMEJO P,OSSA L DE,GAMEZ J A.Knowledge-Based Systems Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking[J].Knowledge-Based Systems,2012,25(1):35-44.
[26]HU Q,YU D,XIE Z.Neighborhood classifiers[J].Expert Systems with Applications,2008,34(2):866-876.
[1] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[4] 陈于思, 艾志华, 张清华.
基于三角不等式判定和局部策略的高效邻域覆盖模型
Efficient Neighborhood Covering Model Based on Triangle Inequality Checkand Local Strategy
计算机科学, 2022, 49(5): 152-158. https://doi.org/10.11896/jsjkx.210300302
[5] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[6] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[7] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[8] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[9] 杨蕾, 降爱莲, 强彦.
基于自编码器和流形正则的结构保持无监督特征选择
Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization
计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211
[10] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[11] 胡艳梅, 杨波, 多滨.
基于网络结构的正则化逻辑回归
Logistic Regression with Regularization Based on Network Structure
计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[12] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[13] 丁思凡, 王锋, 魏巍.
一种基于标签相关度的Relief特征选择算法
Relief Feature Selection Algorithm Based on Label Correlation
计算机科学, 2021, 48(4): 91-96. https://doi.org/10.11896/jsjkx.200800025
[14] 滕俊元, 高猛, 郑小萌, 江云松.
噪声可容忍的软件缺陷预测特征选择方法
Noise Tolerable Feature Selection Method for Software Defect Prediction
计算机科学, 2021, 48(12): 131-139. https://doi.org/10.11896/jsjkx.201000168
[15] 张亚钏, 李浩, 宋晨明, 卜荣景, 王海宁, 康雁.
混合人工化学反应优化和狼群算法的特征选择
Hybrid Artificial Chemical Reaction Optimization with Wolf Colony Algorithm for Feature Selection
计算机科学, 2021, 48(11A): 93-101. https://doi.org/10.11896/jsjkx.210100067
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!