计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 402-408.doi: 10.11896/jsjkx.191100094
周玉, 任钦差, 牛会宾
ZHOU Yu, REN Qin-chai, NIU Hui-bin
摘要: 机器学习作为数据挖掘中一种重要的工具,不只是对人的认知学习过程的探索,还包括对数据的分析处理。面对大量数据的挑战,目前一部分学者专注于机器学习算法的改进和开拓,另一部分研究人员则致力于样本数据的选择和数据集的缩减,这两方面的研究工作是并行的。训练样本数据选择是机器学习的一个研究热点,通过对样本数据的有效选择,提取更具有信息量的样本,剔除冗余样本和噪声数据,从而提高训练样本质量,进而获得更好的学习性能。文中就目前存在的样本数据选择方法进行综述研究,从基于抽样的方法、基于聚类的方法、基于近邻分类规则的方法这三大类以及其他相关数据选择方法4个方面对目前存在的方法进行总结和分析对比,并对训练样本数据选择方法存在的问题和未来研究方向提出一些总结和展望。
中图分类号:
[1] SZALAY A,GRAY J.Drowning in data[OL].https://www.sciam.com/explorations /1999/. [2] FAYYAD U M,PIATETSKY-SHAPIRO G,SMYTH P.From data mining to knowledge discovery:an overview[M]//Advances in Knowledge Discovery and Data Mining.American Association for Artificial Intelligence,1996. [3] BLUM A L,LANGLEY P.Selection of relevant features and examples in machine learning[J].Artificial Intelligence,1997,97(1/2):245-271. [4] BARBU A,SHE Y,DING L,et al.Feature Selection with An-nealing for Computer Vision and Big Data Learning[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(2):272-286. [5] LIU Y,BI J W,FAN Z P.Multi-class sentiment classification:The experimental comparisons of feature selection and machine learning algorithms[J].Expert Systems with Applications,2017,80:323-339. [6] DASGUPTA A,DRINEAS P,HARB B,et al.Feature selection methods for text classification[C]//Acm Sigkdd International Conference on Knowledge Discovery & Data Mining.ACM,2007. [7] LIU H.Feature Selection for Knowledge Discovery and DataMining[M].Kluwer Academic Publishers,1998. [8] KIVINEN J,MANNILA H.The power of sampling in know-ledge discovery[C]//Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems.ACM,1994:77-85. [9] BALCÁZAR J,DAI Y,WATANABE O.A random samplingtechnique for training support vector machines[C]//International Conference on Algorithmic Learning Theory.Springer-Verlag,2001. [10] FERRAGUT E M,LASKA J.Randomized Sampling for Large Data Applications of SVM[C]//International Conference on Machine Learning & Applications.IEEE Computer Society,2012. [11] LEE Y J,MANGASARIAN O L.RSVM:reduced support vector machines[C]//SIAM International Conference on Data Mi-ning.2001. [12] LEE Y J,HUANG S Y.Reduced Support Vector Machines:A Statistical Theory[J].IEEE Transactions on Neural Networks,2007,18(1):1-13. [13] LI X,CERVANTES J,YU W.Fast classification for large data sets via random selection clustering and Support Vector Ma-chines[M].IOS Press,2012. [14] ZHANG L,GUO J.A Method for the Selection of TrainingSamples Based on Boundary Samples[J].Journal of Beijing Uni-versity of Posts and Telecommunications,2006,29(4):77-80. [15] ALMEIDA M B D,BRAGA A D P,BRAGA J P.SVM-KM:Speeding SVMs learning with a priori cluster selection and k-means[C]//Brazilian Symposium on Neural Networks.IEEE,2000. [16] LLOYD S P.Least squares quantization in PCM[J].IEEETrans,1982,28(2):129-137. [17] GUAN D,YUAN W,LEE Y K,et al.Improving supervisedlearning performance by using fuzzy clustering method to select training data[J].Journal of Intelligent & Fuzzy Systems,2008,19(4):321-334. [18] ZHOU Y,ZHU A F,ZHOU L,et al.Sample data selectionmethod for neural network classifiers[J].Journal of Huazhong University of Science and Technology(Natural Science Edition),2012,40(6):39-43. [19] PEDRYCZ W.From fuzzy sets to shadowed sets:Interpretation and computing[J].International Journal of Intelligent Systems,2010,24(1):48-61. [20] CHEN J,ZHANG C,XUE X,et al.Fast instance selection for speeding up support vector machines[J].Knowledge-Based Sys-tems,2013,45(3):1-7. [21] SHEN X J,MU L,LI Z,et al.Large-scale support vector machine classification with redundant data reduction[J].Neuro-computing,2016,172:189-197. [22] KANG J,RYU K R,KWON H C.Using Cluster-Based Sam-pling to Select Initial Training Set for Active Learning in Text Classification[C]//Pacific-asia Conference on Knowledge Discovery & Data Mining.Springer Berlin Heidelberg,2004. [23] XU Z,YU K,TRESP V,et al.Representative sampling for text classification using support vector machines[C]//European Conference on Ir Research.Springer-Verlag,2003. [24] VAPNIK V N,VAPNIK V.Statistical Learning Theory[J].John Wiley and Sons,Inc.,1998. [25] WAN C H,LEE L H,RAJKUMAR R,et al.A hybrid text classication approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine[J].Expert Systems with Applications,2012,39(15):11880-11888. [26] MATEI R,POP P C,VÂLEAN H .Optical character recognition in real environments using neural networks and k-nearest neighbor[J].Applied Intelligence,2013,39(4):739-748. [27] GONZÁLEZ M,BERGMEIR C,TRIGUERO I,et al.On thestopping criteria for k-nearest neighbor in positive unlabeled time series classification problems[J].Information Sciences,2016,328:42-59. [28] HART B P E.The condensed nearest neighbor rule [J].IEEE Transactions on Information Theory,1968,14(3):515-516. [29] GATES G W.The reduced nearest neighbor rule (Corresp.)[J].IEEE Transactions on Information Theory,1972,18(3):431-433. [30] RITTER G L,WOODRUFF H B,LOWRY S R,et al.An algorithm for a selective nearest neighbor decision rule (Corresp.)[J].IEEE Transactions on Information Theory,1975,21(6):665-669. [31] DASARATHY B V.Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design[J].IEEE Transactions on Systems Man & Cybernetics,1994,24(3):511-517. [32] ANGIULLI F.Fast condensed nearest neighbor rule[C]//International Conference on Machine Learning.ACM,2005. [33] SHIN H,CHO S.Neighborhood Property-Based Pattern Selec-tion for Support Vector Machines[J].Neural Computation,2007,19(3):816-855. [34] LI J,WANG Y P.A Fast Neighbor Prototype Selection Algorithm Based on Local Mean and Class Global Information [J].Acta Automatica Sinica,2014,40(6):1116-1125. [35] WILSON D L.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems Man & Cybernetics,1972,SMC-2(3):408-421. [36] TOMEK I.An Experiment with the Edited Nearest-NeighborRule[J].IEEE Transactions on Systems Man & Cybernetics,2007,SMC-6(6):448-452. [37] HATTORI K,TAKAHASHI M.A new edited k-nearest neighbor rule in the pattern classification problem[J].Pattern Recognition,1999,33(3):521-528. [38] SHI X X,HU X G,LIN Y J.K-nearest neighbor classification algorithm combined with mutual neighbors and credibility[J].Journal of Hefei University of Technology,2014,37(9):1055-1058. [39] YU G H.Instance Selection for Complex Classification[D].Tianjin:Tianjin University,2014. [40] LOPEZCHAU A,GARCIA L L,CERVANTES J,et al.DataSelection Using Decision Tree for SVM Classification[C]//IEEE International Conference on Tools with Artificial Intelligence.IEEE Computer Society,2012. [41] CERVANTES J,LAMONT F G,LÓPEZ-CHAU A,et al.Data selection based on decision tree for SVM classification on large data sets[J].Applied Soft Computing,2015,37(C):787-798. [42] YANG M H,AHUJA N.A Geometric Approach to Train Support Vector Machines[J].Proc.IEEE Conf. Computer Vision & Pattern Rec,2000,1(6):430-437. [43] CRISP D J,BURGES C J C.A geometric interpretation of ν-SVM classifiers[C]//International Conference on Neural Information Processing Systems.MIT Press,1999. [44] PENG X.Efficient geometric algorithms for support vector ma-chine classifier[C]//Sixth International Conference on Natural Computation.IEEE,2010. [45] LUO Y,YI W,HE D,et al.Fast reduction for large-scale trai-ning data set[J].Journal of Southwest Jiaotong University,2007,42(4):468-460. [46] LIU C,WANG W,WANG M,et al.An efficient instance selection algorithm to reconstruct training set for support vector machine[J].Knowledge-Based Systems,2017,116(1):58-73. [47] ZHU F,YE N,YU W,et al.Boundary detection and sample reduction for one-class Support Vector Machines[J].Neurocomputing,2014,123:166-173. [48] LI C L,LIU Z D,HUI K H.Boundary Sample Selection Method Based on Cosine Similarity [J].Computer and Modernization,2017(8):66-70. [49] ZHANG A A,ZHENG P,FANG L,et al.A Sample Reduction Method for SVDD and Its Application[J].Jiangxi Science,2014,32(6):884-889. [50] PAN D,YIN Y,SUN Y,et al.Sample Selection in Support Vector Machines:A Fixed Neighborhood Sphere Approach[C]//2016 3rd International Conference on Information Science and Control Engineering (ICISCE).IEEE,2016. [51] LIU C,WANG W,WANG M,et al.An efficient instance selection algorithm to reconstruct training set for support vector machine[J].Knowledge-Based Systems,2017,116(1):58-73. [52] KANGAS J.Prototype Search for a Nearest Neighbor Classifier by a Genetic Algorithm[C]//International Conference on Computational Intelligence & Multimedia Applications.IEEE,1999. [53] AMIREZ-CRUZ J F,FUENTES O,ALARCON-AQUINO V,et al.Instance Selection and Feature Weighting Using Evolutionary Algo-rithms[C]//2006 15th International Conference on Computing.IEEE,2006. [54] NALEPA J,KAWULOK M.Adaptive Genetic Algorithm to Select Training Data for Support Vector Machines[M]//Applications of Evolutionary Computation.Springer Berlin Heidelberg,2014. [55] KAWULOK M,NALEPA J.Dynamically Adaptive Genetic Algorithm to Select Training Data for SVMs[M]//Advances in Artificial Intelligence-IBERAMIA 2014.2014. [56] KAWULOK M,NALEPA J,DUDZIK W.An Alternating Genetic Algorithm for Selecting SVM Model and Training Set[C]//Mexican Conference on Pattern Recognition.Cham:Springer,2017. [57] OTHMAN O M.Instance-Reduction Method based on Ant Colony Optimization[C]//Proceedings of the 2018 10th International Con-ference on Machine Learning and Computing.ACM,2018:47-53. [58] WANG J,NESKOVIC P,COOPERL N.Selecting Data for Fast Support Vector Machines Training[M]//Trends in Neural Computation.2007. [59] HARA K,NAKAYAMA K,KARAF A A M.A Training Data Selection In On-Line Training For Multilayer Neural Networks[C]//IEEE World Congress on IEEE International Joint Conference on Neural Networks.IEEE,2017. [60] WANG Z Y,WANG M W,ZUO J L,et al.The New Boundary Sample Selection Method and Its Application in the Text Classification [J].Journal of Jiangxi Normal University(Natural Science Edition),2019,43(1):76-83. |
[1] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[2] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[3] | 冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028 |
[4] | 宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053 |
[5] | 王润安, 邹兆年. 基于物理操作级模型的查询执行时间预测方法 Query Performance Prediction Based on Physical Operation-level Models 计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074 |
[6] | 陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121 |
[7] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[8] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[9] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[10] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[11] | 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094 |
[12] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[13] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[14] | 张光华, 高天娇, 陈振国, 于乃文. 基于N-Gram静态分析技术的恶意软件分类研究 Study on Malware Classification Based on N-Gram Static Analysis Technology 计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203 |
[15] | 齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126 |
|