计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 178-184.doi: 10.11896/jsjkx.190700089

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于遗传实例和特征选择的K近邻训练集优化方法

董明刚1, 2, 黄宇扬1, 敬超1, 2   

  1. 1 桂林理工大学信息科学与工程学院 广西 桂林 541004
    2 广西嵌入式技术与智能系统重点实验室 广西 桂林 541004
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 敬超(jingchao@glut.edu.cn)
  • 作者简介:d2015mg@qq.com
  • 基金资助:
    国家自然科学基金(61563012, 61802085, 61203109);广西自然科学基金(2014GXNSFAA118371, 2015GXNSFBA139260);广西嵌入式技术与智能系统重点实验室基金(2018A-04)

K-Nearest Neighbor Classification Training Set Optimization Method Based on Genetic Instance and Feature Selection

DONG Ming-gang1, 2, HUANG Yu-yang1, JING Chao1, 2,   

  1. 1 College of Information Science and Engineering, Guilin University of Technology, Guilin, Guangxi 541004, China
    2 Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin, Guangxi 541004, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:DONG Ming-gang, born in 1977, Ph.D, professor, is a member of China Computer Federation.His main research interests include intelligent computing and its applications, machine learning.
    JING Chao, born in 1983, Ph.D, associate professor, is a member of China Computer Federation.His main research interests are intelligent computing, deep reinforcement learning and optimization approaches.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61563012, 61802085, 61203109), Guangxi Natural Science Foundation (2014GXNSFAA118371, 2015GXNSFBA139260) and Guangxi Key Laboratory of Embedded Technology and Intelligent Systems (2018A-04).

摘要: K近邻的分类性能依赖于训练集的质量。设计高效的训练集优化算法具有重要意义。针对传统的进化训练集优化算法效率较低、误删率较高的不足, 提出了一种遗传训练集优化算法。该算法采用基于最大汉明距离的高效遗传算法, 每次交叉保留父代并生成两个新的具有最大汉明距离的子代, 既提高了效率, 又保证了种群多样性。该算法将局部的噪声样本删除策略与特征选择策略相结合。首先使用决策树算法确定噪声样本存在的范围, 然后使用遗传算法精准删除此范围内的噪声样本和全局的噪声特征, 降低了误删率, 提高了效率。该算法采用基于最近邻规则的验证集选择策略, 进一步提高了遗传算法实例选择和特征选择的准确度。在15个标准数据集上, 该方法相较于协同进化实例特征选择算法IFS-CoCo、加权协同进化实例特征选择算法CIW-NN、进化特征选择算法EIS-RFS、进化实例选择算法PS-NN、K近邻算法KNN, 在分类精度上分别平均提升了2.18%, 2.06%, 5.61%, 4.06%和4.00%。实验结果表明, 所提方法的分类精度和优化效率优于当前的进化训练集优化算法。

关键词: K近邻, 决策树, 实例选择, 特征选择, 遗传算法, 噪声样本

Abstract: The classification performance of K-Nearest Neighbor depends on the quality of training set.It is significant to design an efficient training set optimization algorithm.Two major drawbacks of traditional evolutionary training set optimization algorithm are low efficiency and removing the non-noise samples and features by mistake.To address these issues, this paper proposes a genetic training set optimization algorithm.The algorithm uses the efficient genetic algorithm based on the maximum Hamming distance.Each cross preserves the parent and generates two new children with the largest Hamming distance, which not only improves the efficiency but also ensures the population diversity.In the proposed algorithm, the local noise sample deletion strategy is combined with the feature selection strategy.Firstly, the decision tree is used to determine the range of noise samples.Then the genetic algorithm is used to remove the noise samples in this range and select the features simultaneously.It reduces the risk of mistaken and improves the efficiency.At last, the 1NN-based selection strategy of validation set is used to improve the instance and feature selection accuracy of the genetic algorithm.Compared with co-evolutionary instance feature selection algorithm ( IFS-CoCo), weighted co-evolutionary instance feature selection algorithm (CIW-NN), evolutionary feature selection algorithm (EIS-RFS, evolutionary instance selection algorithm (PS-NN) and traditional KNN, the average improvement of the proposed algorithm in classification accuracy is 2.18% , 2.06%, 5.61%, 4.06%, 4.00%, respectively.The experiments results suggest that the proposed method has higher classification accuracy and optimization efficiency.

Key words: Decision tree, Feature selection, Genetic algorithm, Instance selection, K-nearest neighbor, Noise sample

中图分类号: 

  • TP181
[1]COVER T, HART P.Nearest neighbor pattern classification[J].IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[2]ZHAN Y, DAI S, MAO Q, et al.A Video Semantic Analysis Method Based on Kernel Discriminative Sparse Representation and Weighted KNN[J].Computer Journal, 2018, 58(6):1360-1372.
[3]WANG Y, YANG Y W.KNN Similarity Graph AlgorithmBased on Heap and Neighborhood Coexistence[J].Computer Science, 2018, 45(5):196-200, 227.
[4]FENG G L, ZHOU W G.Spark-based Parallel Outlier Detection Algorithm of K-nearest Neighbor[J].Computer Science, 2018, 45(S2):349-352, 366.
[5]DENG Z, ZHU X, CHENG D, et al.Efficient kNN classification algorithm for big data[J].Neurocomputing, 2016, 195(C):143-148.
[6]ZHANG S, LI X, MING Z, et al.Efficient kNN Classification With Different Numbers of Nearest Neighbors[J].IEEE Tran-sactions on Neural Networks & Learning Systems, 2017, PP(99):1-12.
[7]ZHANG S, LI X, ZONG M, et al.Learning k, for kNN Classification[J].Acm Transactions on Intelligent Systems & Techno-logy, 2017, 8(3):43.
[8]GILPITA R, YAO X.Evolving edited k-nearest neighbor classifiers[J].International Journal of Neural Systems, 2009, 18(6):459-467.
[9]XIE H, LIANG D, ZHANG Z, et al.A Novel Pre-Classification Based kNN Algorithm[C]∥IEEE International Conference on Data Mining Workshops.New Orleans, America, 2017:1269-1275.
[10]LIU H, MOTODA H.Instance Selection and Construction forData Mining[M].Springer International, 2001:448-454.
[11]WETTSCHERECK D, AHA D W, MOHRI T.A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms[J].Artificial Intelligence Review, 1997, 11(1/2/3/4/5):273-314.
[12]EIBEN A E, SCHOENAUER M.Evolutionary Computing[J].Soft Computing, 1998, 82(1):1-6.
[13]ACAMPORA G, TORTORA G, VITIELLO A.Applying SPEA2to prototype selection for nearest neighbor classification[C]∥IEEE International Conference on Systems, Man, andCyberne-tics.Montreal, Canada, 2017:003924-003929.
[14]SASIKALA S, APPAVU A B S, GEETHA S.A novel adaptive feature selector for supervised classification[J].Information Processing Letters, 2017, 117:25-34.
[15]KHIABANI A, SABBAGHI A.PHGA:Proposed hybrid genetic algorithm for feature selection in binary classification[C]∥International Conference on Information and Knowledge Technology.Tehran, Iran, 2017:147-154.
[16]DERRAC J, GARCA S, HERRERA F.Ifs-CoCo:Instance and Feature Selection Based on Cooperative Coevolution With Nearest Neighbor Rule[J].Pattern Recognition, 2010, 43(6):2082-2105.
[17]DERRAC J, TRIGUERO I, GARCIA S, et al.Integrating in-stance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms[J].IEEE Transactions on Systems Man & Cybernetics Part B, 2012, 42(5):1383-1397.
[18]DERRAC J, CORNELIS C, GARCA S, et al.Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection[J].Information Sciences, 2012, 186(1):73-92.
[19]DERRAC J, VERBIEST N, GARCA S, et al.On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection[J].Soft Computing, 2013, 17(2):223-238.
[20]BRAHIM A B, LIMAM M.A hybrid feature selection method based on instance learning and cooperative subset search ☆[J].Pattern Recognition Letters, 2016, 69:28-34.
[21]ESHELMAN L J.The CHC Adaptive Search Algorithm:How to Have Safe Search When Engaging in Nontraditional Genetic Recombination[J].Foundations of Genetic Algorithms, 1991, 1:265-283.
[22] GOLDBERG D E, SASTRY K.A Practical Schema Theorem for Genetic Algorithm Design and Tuning[C]∥Genetic and Evolutionary Computation Conference.San Francisco, America, 2001:328-335.
[23]HUANG Y Y, DONG M G, JING C.Genetic instance selection algorithm for k-nearest neighbor classifier[J].Journal of Computer Applications, 2018, 38(11):3112-3118.
[1] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[4] 杨浩雄, 高晶, 邵恩露.
考虑一单多品的外卖订单配送时间的带时间窗的车辆路径问题
Vehicle Routing Problem with Time Window of Takeaway Food ConsideringOne-order-multi-product Order Delivery
计算机科学, 2022, 49(6A): 191-198. https://doi.org/10.11896/jsjkx.210400005
[5] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[6] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[7] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[8] 沈彪, 沈立炜, 李弋.
空间众包任务的路径动态调度方法
Dynamic Task Scheduling Method for Space Crowdsourcing
计算机科学, 2022, 49(2): 231-240. https://doi.org/10.11896/jsjkx.210400249
[9] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[10] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[11] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[12] 杨蕾, 降爱莲, 强彦.
基于自编码器和流形正则的结构保持无监督特征选择
Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization
计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211
[13] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[14] 胡艳梅, 杨波, 多滨.
基于网络结构的正则化逻辑回归
Logistic Regression with Regularization Based on Network Structure
计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[15] 吴善杰, 王新.
基于AGA-DBSCAN优化的RBF神经网络构造煤厚度预测方法
Prediction of Tectonic Coal Thickness Based on AGA-DBSCAN Optimized RBF Neural Networks
计算机科学, 2021, 48(7): 308-315. https://doi.org/10.11896/jsjkx.200800110
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!