基于粗糙集和改进鲸鱼优化算法的特征选择方法

doi:10.11896/jsjkx.181202285

摘要/Abstract

摘要： 随着互联网和物联网技术的发展,数据的收集变得越发容易。但是,高维数据中包含了很多冗余和不相关的特征,直接使用会徒增模型的计算量,甚至会降低模型的表现性能,故很有必要对高维数据进行降维处理。特征选择可以通过减少特征维度来降低计算开销和去除冗余特征,以提高机器学习模型的性能,并保留了数据的原始特征,具有良好的可解释性。特征选择已经成为机器学习领域中重要的数据预处理步骤之一。粗糙集理论是一种可用于特征选择的有效方法,它可以通过去除冗余信息来保留原始特征的特性。然而,由于计算所有的特征子集组合的开销较大,传统的基于粗糙集的特征选择方法很难找到全局最优的特征子集。针对上述问题,文中提出了一种基于粗糙集和改进鲸鱼优化算法的特征选择方法。为避免鲸鱼算法陷入局部优化,文中提出了种群优化和扰动策略的改进鲸鱼算法。该算法首先随机初始化一系列特征子集,然后用基于粗糙集属性依赖度的目标函数来评价各子集的优劣,最后使用改进鲸鱼优化算法,通过不断迭代找到可接受的近似最优特征子集。在UCI数据集上的实验结果表明,当以支持向量机为评价所用的分类器时,文中提出的算法能找到具有较少信息损失的特征子集,且具有较高的分类精度。因此,所提算法在特征选择方面具有一定的优势。

关键词: 粗糙集理论, 改进鲸鱼优化算法, 特征选择, 属性依赖度, 最优特征子集

Abstract: With the development of the Internet and Internet of Things technologies,data collection has become easier.However,it is necessary to reduce the dimensionality of high-dimensional data.High-dimensional data contain many redundant and unrelatedfeatures,which will increase the computational complexity of the model and even reduce the performance of the model.Feature selection can reduce the computational cost and remove redundant features by reducing feature dimensions to improve the performance of a machine learning model,and retain the original features of the data,with good interpretability.It has become one of important data preprocessing steps in machine learning.Rough set theory is an effective method which can be used to feature selection.It preserves the characteristics of the original features by removing redundant information.However,it is difficult to find the global optimal feature subset by using the traditional rough sets-based feature selection method because the cost of computing all feature subset combinations is very high.In order to overcome above problems,a feature selection method based on rough sets and improved whale optimization algorithm was proposed.An improved whale optimization algorithm was proposed by employing poli-tics of population optimization and disturbance so as to avoid local optimization.The algorithm first randomly initializes a series of feature subsets,and then uses the objective function based on the rough sets attribute dependency to evaluate the goodness of each subset.Finally,the improved whale optimization algorithm is used to find an acceptable approximate optimal feature subset by iterations.The experimental results on the UCI dataset show that the proposed algorithm can find a subset of features with less information loss and has higher classification accuracy when the support vector machine is used as the classifier for evaluation.Therefore,the proposed algorithm has a certain advantage in feature selection.

Key words: Attribute dependency, Feature selection, Improved whale optimization algorithm, Optimal feature subset, Rough set theory

中图分类号:

TP301.6

王生武,陈红梅. 基于粗糙集和改进鲸鱼优化算法的特征选择方法[J]. 计算机科学, 2020, 47(2): 44-50. https://doi.org/10.11896/jsjkx.181202285

WANG Sheng-wu,CHEN Hong-mei. Feature Selection Method Based on Rough Sets and Improved Whale Optimization Algorithm[J]. Computer Science, 2020, 47(2): 44-50. https://doi.org/10.11896/jsjkx.181202285

参考文献

[1]ZHANG D,CHEN S,ZHOU Z.Constraint Score:A new filter method for feature selection with pairwise constraints[J].Pattern Recognition,2008,41(5):1440-1451.
[2]SOLORIO-FERNANDEZ S,MARTINEZ-TRINIDAD J F, CARRASCO-OCHOA J A.A new unsupervised spectral feature selection method for mixed data:A Filter Approach[J].Pattern Recognition,2017,72:314-326.
[3]LI J D,LIU H.Challenges of feature selection for big data analytics[J].IEEE Intelligent Systems,2017,32(2):9-15.
[4]MIAO J Y,NIU L F.A survey on feature selection[J].Procedia Computer Science,2016,91:919-926.
[5]CHANDRASHEKAR G,SAHIN F.A survey on feature selection methods[J].Computers and Electrical Engineering,2014,40(1):16-28.
[6]LI M,KAMILI M.Research on feature selection methods and algorithms[J].Computer Technology and Development,2013(12):16-21.
[7]LEAS S,CANUTO AM D P.Filter-based optimization tech-niques for selection of feature subsets in ensemble systems[J].Expert Systems with Applications,2014,41(4):1622-1631.
[8]YANG P,LIU W,ZHOU B B,et al.Ensemble-based wrapper methods for feature selection and class imbalance learning[C]∥Advances in Knowledge Discovery and Data Mining.2013,7818:544-555.
[9]HAMED T,DARA R,KREMER S C.An Accurate,fast embedded feature selection for SVMs[C]∥Proceedings of the 2015 International Conference on Machine Learning and Applications.Piscataway,NJ:IEEE,2015:135-140.
[10]PAWLAK Z.Rough sets[J].International Journal of Computer and Information Science,1982,11(5):341-356.
[11]YU Y,PEDRYCZ W,Miao D.Neighborhood rough sets based multi-label classification for automatic image annotation[C]∥Proceedings of the 2013 Ifsa World Congress and Nafips Meeting.Piscataway,NJ:IEEE,2013:1373-1387.
[12]WANG C,SHAO M,He Q,et al.Feature subset selection based on fuzzy neighborhood rough sets[J].Knowledge-Based Systems,2016,111:173-179.
[13]ZHOU J,PEDRYCZ W,Miao D.Shadowed sets in the characterization of rough-fuzzy clustering[J].Pattern Recognition,2011,44(8):1738-1749.
[14]BANERJEE A,MAJI P.Rough sets and stomped normal distribution for simultaneous segmentation and bias field correction in brain MR images[J].IEEE Transactions on Image Process,2015,24(12):5764-5776.
[15]ALBANESE A,PAL S K,PETROSINO A.Rough sets,kernel set,and spatiotemporal outlier detection[J].IEEE Transactions on Knowledge & Data Engineering,2013,26(1):194-207.
[16]ZHOU B,CHEN L,JIA X.Information retrieval using rough set approximations[M]∥ICTs and the Millennium Development Goals.Springer US,2014:185-197.
[17]HU Q H,ZHAO H,YU R D.Efficient symbolic and numerical attribute reduction with neighborhood rough sets[J].Pattern Recognition and Artificial Intelligence,2008,21(6):730-738.
[18]SKOWRON A,RAUSZER C.The discernibility matrices and functions in information systems[C]∥Proceedings of the 1991 Intelligent Decision Support-handbook of Applications and Advances of the Rough Sets theory.Dordrecht:Kluwer Academic Publisher,1991:331-362.
[19]VIEGAS F,ROCHA L,GONÇALVES M,et al.A Genetic Programming approach for feature selection in highly dimensional skewed data[J].Neurocomputing,2018,273:554-569.
[20]OH I S,LEE J S,MOON B R.Hybrid genetic algorithms for feature selection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(11):1424-1437.
[21]MOAYEDIKIA A,ONG K,BOO Y L,et al.Feature selection for high dimensional imbalanced class data using harmony search[J].Engineering Applications of Artificial Intelligence,2017,57:38-49.
[22]MITIC M,VUKOVIC N,PETROVIC M,et al.Chaotic fruit fly optimization algorithm[J].Knowledge-Based Systems,2015,89(C):446-458.
[23]CHEN Y M,MIAO D Q,WANG R Z.A rough set approach to feature selection based on ant colony optimization[J].Pattern Recognition Letters,2010,31(3):226-233.
[24]XUE B,ZHANG M,BROWNE W N.Particle swarm optimization for feature selection in classification:a multi-objective approach[J].IEEE Transactions on Cybernetics,2013,43(6):1656-1671.
[25]WANG X,YANG J,TENG X,et al.Feature selection based on rough sets and particle swarm optimization[J].Pattern Recognition Letters,2007,28(4):459-471.
[26]WANG L,QIU T R,HE N,et al.A method for feature selection based on rough sets and ant colonyoptimization algorithm[J].Journal of Nanjing University(Natural Sciences),2010,46(5):487-493.
[27]CHEN Y,ZHU Q,XU H.Finding rough set reducts with fish swarm algorithm[J].Knowledge-Based Systems,2015,81(C):22-29.
[28]MIRJALILI S,LEWIS A.The Whale optimization algorithm.[J].Advances in Engineering Software,2016,95:51-67.
[29]WAIKATO M L G.Weka 3:Data Mining Software in Java [EB/OL].[2018-07-10].http://www.cs.waikato.ac.nz/ml/weka/.

相关文章 15

[1]	李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3]	康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[4]	储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[5]	孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[6]	李宗然, 陈秀宏, 陆赟, 邵政毅. 鲁棒联合稀疏不相关回归 Robust Joint Sparse Uncorrelated Regression 计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[7]	张叶, 李志华, 王长杰. 基于核密度估计的轻量级物联网异常流量检测方法 Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method 计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[8]	杨蕾, 降爱莲, 强彦. 基于自编码器和流形正则的结构保持无监督特征选择 Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization 计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211
[9]	侯春萍, 赵春月, 王致芃. 基于自反馈最优子类挖掘的视频异常检测算法 Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining 计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[10]	胡艳梅, 杨波, 多滨. 基于网络结构的正则化逻辑回归 Logistic Regression with Regularization Based on Network Structure 计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[11]	周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[12]	丁思凡, 王锋, 魏巍. 一种基于标签相关度的Relief特征选择算法 Relief Feature Selection Algorithm Based on Label Correlation 计算机科学, 2021, 48(4): 91-96. https://doi.org/10.11896/jsjkx.200800025
[13]	滕俊元, 高猛, 郑小萌, 江云松. 噪声可容忍的软件缺陷预测特征选择方法 Noise Tolerable Feature Selection Method for Software Defect Prediction 计算机科学, 2021, 48(12): 131-139. https://doi.org/10.11896/jsjkx.201000168
[14]	张亚钏, 李浩, 宋晨明, 卜荣景, 王海宁, 康雁. 混合人工化学反应优化和狼群算法的特征选择 Hybrid Artificial Chemical Reaction Optimization with Wolf Colony Algorithm for Feature Selection 计算机科学, 2021, 48(11A): 93-101. https://doi.org/10.11896/jsjkx.210100067
[15]	董明刚, 黄宇扬, 敬超. 基于遗传实例和特征选择的K近邻训练集优化方法 K-Nearest Neighbor Classification Training Set Optimization Method Based on Genetic Instance and Feature Selection 计算机科学, 2020, 47(8): 178-184. https://doi.org/10.11896/jsjkx.190700089

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed