Computer Science ›› 2020, Vol. 47 ›› Issue (11A): 402-408.doi: 10.11896/jsjkx.191100094

• Big Data & Data Science • Previous Articles     Next Articles

Research on Training Sample Data Selection Methods

ZHOU Yu, REN Qin-chai, NIU Hui-bin   

  1. School of Electric Power,North China University of Water Resources and Electric Power,Zhengzhou 450011,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:ZHOU Yu,born in 1979,Ph.D,associate professor.His main research interests include intelligence computing and intelligent control.
  • Supported by:
    This work was supported by the Project of Training Young Backbone Teachers in Colleges and Universities of Henan Province (2018GGJS079) and National Natural Science Foundation of China (U1504622,31671580).

Abstract: Machine learning,as an important tool in data mining,not only explores the cognitive learning process of human beings,but also includes the analysis and processing of data.Faced with the challenge of massive data,at present,some researches focus on the improvement and development of machine learning algorithm,while others focus on the selection of sample data and the reduction of data set.The two aspects of researches work in parallel.The selection of training sample data is a research hotspot of machine learning.By effectively selecting sample data,extracting more informative samples,eliminating redundant samples and noise data,thus improving the quality of training samples and obtaining better learning performance.In this paper,the exis-ting methods of sample data selection are reviewed,and the existing methods are carried out in four aspects:sampling-basedme-thod,cluster-based method,nearest neighbor classification rule-based method and other related data selection methods.Summarize and analyze the comparison,and put forward some conclusions and prospects for the problems existing in the training sample data selection method and future research directions.

Key words: Data selection, Machine learning, Neural networks, Support vector machines, Training sample

CLC Number: 

  • TP181
[1] SZALAY A,GRAY J.Drowning in data[OL].https://www.sciam.com/explorations /1999/.
[2] FAYYAD U M,PIATETSKY-SHAPIRO G,SMYTH P.From data mining to knowledge discovery:an overview[M]//Advances in Knowledge Discovery and Data Mining.American Association for Artificial Intelligence,1996.
[3] BLUM A L,LANGLEY P.Selection of relevant features and examples in machine learning[J].Artificial Intelligence,1997,97(1/2):245-271.
[4] BARBU A,SHE Y,DING L,et al.Feature Selection with An-nealing for Computer Vision and Big Data Learning[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(2):272-286.
[5] LIU Y,BI J W,FAN Z P.Multi-class sentiment classification:The experimental comparisons of feature selection and machine learning algorithms[J].Expert Systems with Applications,2017,80:323-339.
[6] DASGUPTA A,DRINEAS P,HARB B,et al.Feature selection methods for text classification[C]//Acm Sigkdd International Conference on Knowledge Discovery & Data Mining.ACM,2007.
[7] LIU H.Feature Selection for Knowledge Discovery and DataMining[M].Kluwer Academic Publishers,1998.
[8] KIVINEN J,MANNILA H.The power of sampling in know-ledge discovery[C]//Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems.ACM,1994:77-85.
[9] BALCÁZAR J,DAI Y,WATANABE O.A random samplingtechnique for training support vector machines[C]//International Conference on Algorithmic Learning Theory.Springer-Verlag,2001.
[10] FERRAGUT E M,LASKA J.Randomized Sampling for Large Data Applications of SVM[C]//International Conference on Machine Learning & Applications.IEEE Computer Society,2012.
[11] LEE Y J,MANGASARIAN O L.RSVM:reduced support vector machines[C]//SIAM International Conference on Data Mi-ning.2001.
[12] LEE Y J,HUANG S Y.Reduced Support Vector Machines:A Statistical Theory[J].IEEE Transactions on Neural Networks,2007,18(1):1-13.
[13] LI X,CERVANTES J,YU W.Fast classification for large data sets via random selection clustering and Support Vector Ma-chines[M].IOS Press,2012.
[14] ZHANG L,GUO J.A Method for the Selection of TrainingSamples Based on Boundary Samples[J].Journal of Beijing Uni-versity of Posts and Telecommunications,2006,29(4):77-80.
[15] ALMEIDA M B D,BRAGA A D P,BRAGA J P.SVM-KM:Speeding SVMs learning with a priori cluster selection and k-means[C]//Brazilian Symposium on Neural Networks.IEEE,2000.
[16] LLOYD S P.Least squares quantization in PCM[J].IEEETrans,1982,28(2):129-137.
[17] GUAN D,YUAN W,LEE Y K,et al.Improving supervisedlearning performance by using fuzzy clustering method to select training data[J].Journal of Intelligent & Fuzzy Systems,2008,19(4):321-334.
[18] ZHOU Y,ZHU A F,ZHOU L,et al.Sample data selectionmethod for neural network classifiers[J].Journal of Huazhong University of Science and Technology(Natural Science Edition),2012,40(6):39-43.
[19] PEDRYCZ W.From fuzzy sets to shadowed sets:Interpretation and computing[J].International Journal of Intelligent Systems,2010,24(1):48-61.
[20] CHEN J,ZHANG C,XUE X,et al.Fast instance selection for speeding up support vector machines[J].Knowledge-Based Sys-tems,2013,45(3):1-7.
[21] SHEN X J,MU L,LI Z,et al.Large-scale support vector machine classification with redundant data reduction[J].Neuro-computing,2016,172:189-197.
[22] KANG J,RYU K R,KWON H C.Using Cluster-Based Sam-pling to Select Initial Training Set for Active Learning in Text Classification[C]//Pacific-asia Conference on Knowledge Discovery & Data Mining.Springer Berlin Heidelberg,2004.
[23] XU Z,YU K,TRESP V,et al.Representative sampling for text classification using support vector machines[C]//European Conference on Ir Research.Springer-Verlag,2003.
[24] VAPNIK V N,VAPNIK V.Statistical Learning Theory[J].John Wiley and Sons,Inc.,1998.
[25] WAN C H,LEE L H,RAJKUMAR R,et al.A hybrid text classication approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine[J].Expert Systems with Applications,2012,39(15):11880-11888.
[26] MATEI R,POP P C,VÂLEAN H .Optical character recognition in real environments using neural networks and k-nearest neighbor[J].Applied Intelligence,2013,39(4):739-748.
[27] GONZÁLEZ M,BERGMEIR C,TRIGUERO I,et al.On thestopping criteria for k-nearest neighbor in positive unlabeled time series classification problems[J].Information Sciences,2016,328:42-59.
[28] HART B P E.The condensed nearest neighbor rule [J].IEEE Transactions on Information Theory,1968,14(3):515-516.
[29] GATES G W.The reduced nearest neighbor rule (Corresp.)[J].IEEE Transactions on Information Theory,1972,18(3):431-433.
[30] RITTER G L,WOODRUFF H B,LOWRY S R,et al.An algorithm for a selective nearest neighbor decision rule (Corresp.)[J].IEEE Transactions on Information Theory,1975,21(6):665-669.
[31] DASARATHY B V.Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design[J].IEEE Transactions on Systems Man & Cybernetics,1994,24(3):511-517.
[32] ANGIULLI F.Fast condensed nearest neighbor rule[C]//International Conference on Machine Learning.ACM,2005.
[33] SHIN H,CHO S.Neighborhood Property-Based Pattern Selec-tion for Support Vector Machines[J].Neural Computation,2007,19(3):816-855.
[34] LI J,WANG Y P.A Fast Neighbor Prototype Selection Algorithm Based on Local Mean and Class Global Information [J].Acta Automatica Sinica,2014,40(6):1116-1125.
[35] WILSON D L.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems Man & Cybernetics,1972,SMC-2(3):408-421.
[36] TOMEK I.An Experiment with the Edited Nearest-NeighborRule[J].IEEE Transactions on Systems Man & Cybernetics,2007,SMC-6(6):448-452.
[37] HATTORI K,TAKAHASHI M.A new edited k-nearest neighbor rule in the pattern classification problem[J].Pattern Recognition,1999,33(3):521-528.
[38] SHI X X,HU X G,LIN Y J.K-nearest neighbor classification algorithm combined with mutual neighbors and credibility[J].Journal of Hefei University of Technology,2014,37(9):1055-1058.
[39] YU G H.Instance Selection for Complex Classification[D].Tianjin:Tianjin University,2014.
[40] LOPEZCHAU A,GARCIA L L,CERVANTES J,et al.DataSelection Using Decision Tree for SVM Classification[C]//IEEE International Conference on Tools with Artificial Intelligence.IEEE Computer Society,2012.
[41] CERVANTES J,LAMONT F G,LÓPEZ-CHAU A,et al.Data selection based on decision tree for SVM classification on large data sets[J].Applied Soft Computing,2015,37(C):787-798.
[42] YANG M H,AHUJA N.A Geometric Approach to Train Support Vector Machines[J].Proc.IEEE Conf. Computer Vision & Pattern Rec,2000,1(6):430-437.
[43] CRISP D J,BURGES C J C.A geometric interpretation of ν-SVM classifiers[C]//International Conference on Neural Information Processing Systems.MIT Press,1999.
[44] PENG X.Efficient geometric algorithms for support vector ma-chine classifier[C]//Sixth International Conference on Natural Computation.IEEE,2010.
[45] LUO Y,YI W,HE D,et al.Fast reduction for large-scale trai-ning data set[J].Journal of Southwest Jiaotong University,2007,42(4):468-460.
[46] LIU C,WANG W,WANG M,et al.An efficient instance selection algorithm to reconstruct training set for support vector machine[J].Knowledge-Based Systems,2017,116(1):58-73.
[47] ZHU F,YE N,YU W,et al.Boundary detection and sample reduction for one-class Support Vector Machines[J].Neurocomputing,2014,123:166-173.
[48] LI C L,LIU Z D,HUI K H.Boundary Sample Selection Method Based on Cosine Similarity [J].Computer and Modernization,2017(8):66-70.
[49] ZHANG A A,ZHENG P,FANG L,et al.A Sample Reduction Method for SVDD and Its Application[J].Jiangxi Science,2014,32(6):884-889.
[50] PAN D,YIN Y,SUN Y,et al.Sample Selection in Support Vector Machines:A Fixed Neighborhood Sphere Approach[C]//2016 3rd International Conference on Information Science and Control Engineering (ICISCE).IEEE,2016.
[51] LIU C,WANG W,WANG M,et al.An efficient instance selection algorithm to reconstruct training set for support vector machine[J].Knowledge-Based Systems,2017,116(1):58-73.
[52] KANGAS J.Prototype Search for a Nearest Neighbor Classifier by a Genetic Algorithm[C]//International Conference on Computational Intelligence & Multimedia Applications.IEEE,1999.
[53] AMIREZ-CRUZ J F,FUENTES O,ALARCON-AQUINO V,et al.Instance Selection and Feature Weighting Using Evolutionary Algo-rithms[C]//2006 15th International Conference on Computing.IEEE,2006.
[54] NALEPA J,KAWULOK M.Adaptive Genetic Algorithm to Select Training Data for Support Vector Machines[M]//Applications of Evolutionary Computation.Springer Berlin Heidelberg,2014.
[55] KAWULOK M,NALEPA J.Dynamically Adaptive Genetic Algorithm to Select Training Data for SVMs[M]//Advances in Artificial Intelligence-IBERAMIA 2014.2014.
[56] KAWULOK M,NALEPA J,DUDZIK W.An Alternating Genetic Algorithm for Selecting SVM Model and Training Set[C]//Mexican Conference on Pattern Recognition.Cham:Springer,2017.
[57] OTHMAN O M.Instance-Reduction Method based on Ant Colony Optimization[C]//Proceedings of the 2018 10th International Con-ference on Machine Learning and Computing.ACM,2018:47-53.
[58] WANG J,NESKOVIC P,COOPERL N.Selecting Data for Fast Support Vector Machines Training[M]//Trends in Neural Computation.2007.
[59] HARA K,NAKAYAMA K,KARAF A A M.A Training Data Selection In On-Line Training For Multilayer Neural Networks[C]//IEEE World Congress on IEEE International Joint Conference on Neural Networks.IEEE,2017.
[60] WANG Z Y,WANG M W,ZUO J L,et al.The New Boundary Sample Selection Method and Its Application in the Text Classification [J].Journal of Jiangxi Normal University(Natural Science Edition),2019,43(1):76-83.
[1] LENG Dian-dian, DU Peng, CHEN Jian-ting, XIANG Yang. Automated Container Terminal Oriented Travel Time Estimation of AGV [J]. Computer Science, 2022, 49(9): 208-214.
[2] NING Han-yang, MA Miao, YANG Bo, LIU Shi-chang. Research Progress and Analysis on Intelligent Cryptology [J]. Computer Science, 2022, 49(9): 288-296.
[3] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[4] LI Yao, LI Tao, LI Qi-fan, LIANG Jia-rui, Ibegbu Nnamdi JULIAN, CHEN Jun-jie, GUO Hao. Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network [J]. Computer Science, 2022, 49(8): 257-266.
[5] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6] ZHANG Guang-hua, GAO Tian-jiao, CHEN Zhen-guo, YU Nai-wen. Study on Malware Classification Based on N-Gram Static Analysis Technology [J]. Computer Science, 2022, 49(8): 336-343.
[7] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[8] CHEN Ming-xin, ZHANG Jun-bo, LI Tian-rui. Survey on Attacks and Defenses in Federated Learning [J]. Computer Science, 2022, 49(7): 310-323.
[9] XIAO Zhi-hong, HAN Ye-tong, ZOU Yong-pan. Study on Activity Recognition Based on Multi-source Data and Logical Reasoning [J]. Computer Science, 2022, 49(6A): 397-406.
[10] WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423.
[11] SUN Jie-qi, LI Ya-feng, ZHANG Wen-bo, LIU Peng-hui. Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation [J]. Computer Science, 2022, 49(6A): 434-440.
[12] YAO Ye, ZHU Yi-an, QIAN Liang, JIA Yao, ZHANG Li-xiang, LIU Rui-liang. Android Malware Detection Method Based on Heterogeneous Model Fusion [J]. Computer Science, 2022, 49(6A): 508-515.
[13] LI Ya-ru, ZHANG Yu-lai, WANG Jia-chen. Survey on Bayesian Optimization Methods for Hyper-parameter Tuning [J]. Computer Science, 2022, 49(6A): 86-92.
[14] ZHAO Lu, YUAN Li-ming, HAO Kun. Review of Multi-instance Learning Algorithms [J]. Computer Science, 2022, 49(6A): 93-99.
[15] WANG Fei, HUANG Tao, YANG Ye. Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion [J]. Computer Science, 2022, 49(6A): 784-789.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!