计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 59-70.doi: 10.11896/jsjkx.230400143

• 数据库&大数据&数据科学 • 上一篇    下一篇

SVM样本约简算法研究综述

张代俐, 汪廷华, 朱兴淋   

  1. 赣南师范大学数学与计算机科学学院 江西 赣州 341000
  • 收稿日期:2023-04-24 修回日期:2023-09-06 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 汪廷华 (wthpku@163.com)
  • 作者简介:(17746670679@163.com)
  • 基金资助:
    国家自然科学基金(61966002);江西省研究生创新专项资金(YC2022-s944)

Overview of Sample Reduction Algorithms for Support Vector Machine

ZHANG Daili, WANG Tinghua, ZHU Xinglin   

  1. School of Mathematics and Computer Science,Gannan Normal University,Ganzhou,Jiangxi 341000,China
  • Received:2023-04-24 Revised:2023-09-06 Online:2024-07-15 Published:2024-07-10
  • About author:ZHANG Daili,born in 1996,postgra-duate.Her main research interests include machine learning and data mi-ning.
    WANG Tinghua,born in 1977,Ph.D,professor,is a senior member of CCF(No.95071S).His main research intere-sts include artificial intelligence and machine learning.
  • Supported by:
    National Natural Science Foundation of China(61966002) and Graduate Innovation Found of Jiangxi Province(YC2022-s944).

摘要: 支持向量机(Support Vector Machine,SVM)是基于统计学习理论和结构风险最小化原则发展起来的一种有监督的机器学习算法,它有效克服了局部最小和维数灾难等问题,具有良好的泛化性能,并被广泛应用于模式识别和人工智能领域。但SVM的学习效率随着训练样本数量的增加而显著降低,对于大规模训练集,采用标准优化方法的传统SVM面临着内存需求过大、执行速度慢,有时甚至无法执行的问题。为了缓解SVM在大规模训练集上存储需求高、训练时间长等问题,学者们提出了SVM样本约简算法。文中首先介绍了SVM理论基础,然后从基于聚类、几何分析、主动学习、增量学习和随机抽样5个方面系统综述了SVM样本约简算法的研究现状,讨论了各种SVM样本约简算法的优缺点,最后总结全文并展望未来。

关键词: 支持向量机, 大规模数据集, 样本约简, 机器学习, 分类

Abstract: Support vector machine(SVM) is a supervised machine learning algorithm developed based on statistical learning theory and the principle of structural risk minimization,which effectively overcomes the problems of local minimum and curse of dimensionality and has good generalization performance.SVM has been widely used in the fields of pattern recognition and artificial intelligence.However,the learning efficiency of SVM decreases significantly with the increase of the number of training samples.For large-scale training datasets,the traditional SVM with standard optimization methods will be confronted with the problems of excessive memory requirements,slow training speed,and sometimes even being unable to execute.To alleviate the problems of high storage requirements and long training time of SVM on large-scale training sets,scholars have proposed SVM sample reduction algorithms.This paper firstly introduces the theoretical basis of the SVM and then systematically reviews the current research status of the SVM sample reduction algorithms from five aspects based on clustering,geometric analysis,active learning,incremental learning and random sampling,respectively.And it discusses the advantages and disadvantages of these algorithms,and finally presents an outlook on the future research of the SVM sample reduction methods.

Key words: Support vector machine, Large-scale data set, Sample reduction, Machine learning, Classification

中图分类号: 

  • TP181
[1]RATHORE M M,SHAH S A,SHUKLA D,et al.The role of AI,machine learning,and big data in digital twinning:A syste-matic literature review,challenges,and opportunities[J].IEEE Access,2021,9:32030-32052.
[2]DEEPA N,PHAM Q V,NGUYEN D C,et al.A survey onblockchain for big data:Approaches,opportunities,and future directions[J].Future Generation Computer Systems,2022,131(C):209-226.
[3]TENG K D,ZHAO Q,TAN H R,et al.Emotional EEG recognition based on SVM-KNN algorithm [J].Computer System Application,2022,31(2):298-304.
[4]GROTHER P J,CANDELA G T,BLUE J L.Fast implemen-tations of nearest neighbor classifiers[J].Pattern Recognition,1997,30(3):459-465.
[5]OUGIARIGLOU S,DIAMANTARAS K I,EVANGELIDIS G.Exploring the effect of data reduction on neural network and support vector machine classification[J].Neurocomputing,2018,280:101-110.
[6]CHARBUTY B,ABDULAZEEZ A.Classification based on decision tree algorithm for machine learning[J].Journal of Applied Science and Technology Trends,2021,2(1):20-28.
[7]LEE C S,CHEANG P Y S,MOSLEHPOUR M.Predictive analytics in business analytics:Decision tree[J].Advances in Decision Sciences,2022,26(1):1-29.
[8]CORTES C,VAPNIK V.Support vector machine[J].Machine Learning,1995,20(3):273-297.
[9]BALASUBRAMANIAM V.Artificial intelligence algorithmwith SVM classification using dermascopic images for melanoma diagnosis[J].Journal of Artificial Intelligence and Capsule Networks,2021,3(1):34-42.
[10]NAEEM M,JAMAL T,DIAZ-MARTINEZ J,et al.Trends and futureperspective challenges in big data[M]//Advances in intelligent data analysis and applications.Singapore:Springer,2022:309-325.
[11]XU X,LIANG T,ZHU J,et al.Review of classical dimensionality reduction and sample selection methods for large-scale data processing[J].Neurocomputing,2019,328:5-15.
[12]ZHOU Z H.Machine learning[M].Singapore:Springer Nature,2021:2-17.
[13]YANG J F,QIAO P R,LI Y M,et al.A review of research on machine learning classification problems and algorithms [J].Statistics and Decision,2019,35(6):36-40.
[14]VAPNIK V,IZMAILOV R.Reinforced SVM method and memorization mechanisms[J].Pattern Recognition,2021,119:108018.
[15]SZELISKI R.Computer vision:Algorithms and applications[M].Berlin:Springer Nature,2022:273-300.
[16]MOCHIDA K,KODA S,INOUE K,et al.Computer vision-based phenotyping for improvement of plant productivity:A machine learning perspective[J].GigaScience,2018,8(1):1-12.
[17]DHAR A,MUKHERJEE H,DASH N S,et al.Text categorization:past and present[J].Artificial Intelligence Review,2021,54(4):3007-3054.
[18]LUO X.Efficient English text classification using selected machine learning techniques[J].Alexandria Engineering Journal,2021,60(3):3401-3409.
[19]MOHAMMADI M,RASHID T A,KARIM S H T,et al.Acomprehensive survey and taxonomyof the SVM-based intrusion detection systems[J].Journal of Network and Computer Applications,2021,178:102983.
[20]ZHANG X,LI C,WANG X,et al.A novel fault diagnosis procedure based on improved symplectic geometry mode decomposition and optimized SVM[J].Measurement,2021,173:108644.
[21]WANG M,CHEN Y,ZHANG X,et al.Roller bearing fault diagnosis based on integrated fault feature and SVM[J].Journal of Vibration Engineering & Technologies,2022,10(3):853-862.
[22]ALI W,TIAN W,DIN S U,et al.Classical and modern face re-cognition approaches:A complete review[J].Multimedia Tools and Applications,2021,80(3):4825-4880.
[23]SALAMH A B S,AKYüZ H I.A new deep learning model for face recognition and registration in distance learning[J].International Journal of Emerging Technologies in Learning,2022,17(12):29.
[24]ALWAJIDI S,YANG L.Multiresolution hierarchical supportvector machine for classification of large datasets[J].Know-ledge and Information Systems,2022,64(12):1-16.
[25]GOYAL S.Effectivesoftware defect prediction using supportvector machines[J].International Journal of System Assurance Engineering and Management,2022,13(2):681-696.
[26]TANVEER M,RAJANI T,RASTOGI R,et al.Comprehensive review on twin support vector machines[J].arXiv:2105.00336,2022.
[27]DASH R,DASH D K,PANDA R S.Linguistic information for decision-making using SVM[M]//Advances in data science and management.Singapore:Springer,2022:3-10.
[28]WANG T H,CHEN J T.A review of research on the selection of kernel functions[J].Computer Engineering and Design,2012,33(3):1181-1186.
[29]ALSHUDUKHI J S.Smart and interactive healthcare systembased on speech recognition using soft margin formulation and kernel trick[J].International Journal of System Assurance Engineering and Management,2022,15:324-333.
[30]WANG Q.Support vector machine algorithm inmachine learning[C]//2022 IEEE International Conference on Artificial Intelligence and Computer Applications.IEEE,2022:750-756.
[31]DIVYANTH L G,CHELLADURAI V,LOGANATHAN M,et al.Identification of green gram(vigna radiata) grains infested by callosobruchus maculatus through X-ray imaging and GAN-based image augmentation[J].Journal of Biosystems Enginee-ring,2022,47(3):302-317.
[32]NAGPAL M,KAUSHAL M,SHARMA A.A feature reduced intrusion detection system with optimized SVM using big bang big crunch optimization[J].Wireless Personal Communications,2022,122(2):1939-1965.
[33]HASSANAT A B,ALI H N,TARAWNEHA S,et al.Magneticforce classifier:A novel method for big data classification[J].IEEE Access,2022,10:12592-12606.
[34]GAYE B,ZHANG D,WULAMU A.Improvement of support vector machine algorithm in big data background[J].Mathema-tical Problems in Engineering,2021,2021:1-9.
[35]ALI A A A,MALLAIAH S.Intelligent handwritten recognition using hybrid CNN architectures based-SVM classifier with dropout[J].Journal of King Saud University-Computerand Information Sciences,2022,34(6):3294-3300.
[36]KUDO T,MATSUMOTO Y.Chunking with support vectormachines[C]//Proceedings of the 2nd Conference of the North American Chapter of the Association for Computational Linguistics on Language technologies.Association for Computational Linguistics,2001:1-8.
[37]SINGH K R,NEETHU K P,MADHUREKAA K,et al.ParallelSVM model for forest fire prediction[J].Soft Computing Letters,2021,3:100014.
[38]BADR E,ALMOTAIRI S,SALAM M A,et al.New sequentialand parallel supportvector machine with grey wolf optimizer for breast cancer diagnosis[J].Alexandria Engineering Journal,2022,61(3):2520-2534.
[39]DESHPANDE N J,KARIBASAPPA K G,TOTAD S G.Comparative analysis of optimization techniques on multi-classSVM[C]//International Conference for Emerging Technology.IEEE,2022:1-6.
[40]QIN Y,DING S F.A review of semi-supervised clustering[J].Computer Science,2019,46(9):15-21.
[41]HUANG H,XIONG W,WU K et al.K-means hybrid iterative clustering based on memory transfer flagfish optimization [J].Journal of Shanghai Jiaotong University,2022,56(12):1638.
[42]YAO Y,LIU Y,YU Y,et al.K-SVM:An effective SVM algo-rithm based on k-means clustering[J].Journal of Computers,2013,8(10):2632-2639.
[43]ALMEIDA M,BRAGA A,BRAGA J P.SVM-KM:SpeedingSVMs learning with a priori cluster selection and k-means[C]//Proceedings of the 6th Brazilian Symposium on Neural Networks.IEEE,2000:162-167.
[44]BANG S,JHUN M.Weightedsupport vector machine using k-means clustering[J].Communications in Statistics-Simulation and Computation,2014,43(10):2307-2324.
[45]KOGGALAGE R,HALGAMUGE S.Reducing the number oftraining samples for fast support vectormachine classification[J].Neural Information Processing-Letters and Reviews,2004,2(3):57-65.
[46]SHEN X J,MU L,LI Z,et al.Large-scale support vector ma-chine classification with redundant data reduction[J].Neurocomputing,2016,172:189-197.
[47]JAVADI S,HASHEMY SHAHDANY S M,NESHAT A,et al.Multi-parameter risk mapping of qazvin aquifer by classic and fuzzy clustering techniques[J].Geocarto International,2022,37(4):1160-1182.
[48]CERVANTES J,LI X,YU W.Support vector machine classification based on fuzzy clustering for large datasets[C]//Mexican International Conference on Artificial Intelligence.Springer,2006:572-582.
[49]MANIMALA K,DAVID I G,SELVI K.A novel dataselection technique using fuzzy c-means clustering to enhance SVM-based power quality classification[J].Soft Computing,2015,19(11):3123-3144.
[50]ALMASI O N,ROUHANI M.Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets[J].Turkish Journal of Electrical Engineering and Computer Sciences,2016,24(1):219-233.
[51]WANG L,SUI M,LI Q,et al.A new method of sample reduction for support vector classification[C]//Asia-Pacific Services Computing Conference.IEEE,2012:301-304.
[52]KRISHNA D P,SENGUTTUVAN A,LATHA T S.Clustering on large numeric data sets using hierarchical approach:Birch[J].Global Journal of Computer Science and Technology,2012,12(C12):29-32.
[53]YU H,YANG J,HAN J,et al.Making SVMs scalable to large data sets using hierarchical cluster indexing[J].Data Mining and Knowledge Discovery,2005,11(3):295-321.
[54]CHU Z,WANG W,LI B,et al.An operation health status monitoring algorithm of special transformers based on birch and Gaussian cloud methods[J].Energy Reports,2021,7:253-260.
[55]LANG A,SCHUBERT E.Betula:Numerically stable CF-trees forbirch clustering[C]//International Conference on Similarity Search and Applications.Springer,2020:281-296.
[56]ALZU’BI A,BARHAM M.Automatic birch thresholding with features transformation for hierarchical breast cancer clustering[J].International Journal of Electrical and Computer Enginee-ring,2022,12(2):1498-1507.
[57]AWAD M,KHAN L,BASTANI F,et al.An effective support vector machines performance using hierarchical clustering[C]//Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence.IEEE,2004:663-667.
[58]LUO F,KHAN L,BASTANI F,et al.A dynamically growingself-organizing tree(DGSOT) for hierarchical clustering gene expression profiles[J].Bioinformatics,2004,20(16):2605-2617.
[59]HE D J,WU X R,YU L.Research ondensity clustering methods[J].Communications Technology,2022,55(2):135-142.
[60]WU F F,ZHAO Y L,JIANG Z F.Support vector machine classification algorithm based ondensity clustering[J].Academic Journal of Xi’an Jiaotong University,2005,39(12):1319-1322.
[61]ZHANG H,ZOU K Q,CUI J,et al.An improved fuzzy support vector machine based on density clustering[J].Computer Engineering,2009,35(5):194-196.
[62]AKASAPU A K,RAO P S,SHARMA L K,et al.Density based k-nearest neighbors clustering algorithm for trajectory data[J].International Journal of Advanced Science and Technology,2011,31(1):47-57.
[63]ZHAO F J,HE X S,WANG J.An improved support vector machine based on density clustering[J].Journal of Jiamusi University:Natural Science Edition,2010,28(4):587-589.
[64]SCHIILKOP P B,BURGEST C,VAPNIK V.Extracting support data for a given task[C]//Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining.AAAI Press,1995:252-257.
[65]SHI P,KUANG L,TANG Y,et al.Pond water quality data stream anomaly detection based on improved SVDD algorithm[J].Journal of Agricultural Engineering,2021,37(24):249-256.
[66]ZHANG Y,CHI Z X,XIE F D.Improved PCM-based support vector description of multiclass classifier[J].Computer Science,2008(8):149-153.
[67]CERVANTES J,LI X,YU W,et al.Supportvector machineclassification for large data sets via minimum enclosing ball clustering[J].Neurocomputing,2008,71(4/5/6):611-619.
[68]TSANG I W,KWOK J T,CHEUNG P M,et al.Core vector machines:Fast SVM training on very large data sets[J].Journal of Machine Learning Research,2005,6(4):363-392.
[69]TSANG I W H,KWOK J T Y,ZURADA J M.Generalized core vector machines[J].IEEE Transactions on Neural Networks,2006,17(5):1126-1140.
[70]WANG D,ZHANG B,ZHANG P,et al.An online core vector machine with adaptiveMEB adjustment[J].Pattern Recognition,2010,43(10):3468-3482.
[71]LI D H,DU S X,WU T J.A fast incremental learning algorithm for linear support vector machines based on shell vectors [J].Journal of Zhejiang University:Engineering Science,2006,40(2):203-207.
[72]BENNETT K P,BREDENSTEINER E J.Duality and geometry in SVM classifiers[C]//Proceedings of the International Confe-rence on Machine Learning.Citeseer,2000:57-64.
[73]GOODRICH B,ALBRECHT D,TISCHER P.Algorithms forthe computation of reduced convex hulls[C]//Proceedings of the Australasian Joint Conference on Artificial Intelligence.Springer,2009:230-239.
[74]CRISP D,BURGES C J.A geometric interpretation of v-SVM classifiers[J].Advances in Neural Information Processing Systems,1999,12:244-250.
[75]MAVROFORAKIS M E,SDRALIS M,THEODORIDIS SA.Geometric nearest point algorithm for the efficient solution of the SVM classification task[J].IEEE Transactions on Neural Networks,2007,18(5):1545-1549.
[76]MAVROFORAKIS M E,SDRALIS M,THEODORIDIS S.A novel SVM geometric algorithm based on reduced convex hulls[C]//Proceedings of the 18th International Conference on Pattern Recognition.IEEE,2006:564-568.
[77]MAVROFORAKIS M E,THEODORIDIS S.A geometric approach to support vector machine classification[J].IEEE Tran-sactions on Neural Networks,2006,17(3):671-682.
[78]LIU Z,LIU J G,PAN C,et al.A novel geometric approach to binary classification based on scaled convex hulls[J].IEEE Transactions on Neural Networks,2009,20(7):1215-1220.
[79]CHAU A L,LI X,YU W.Large data sets classification using convex-concave hull and support vector machine[J].Soft Computing,2013,17(5):793-804.
[80]LOPEZ-CHAU A,LI X,YU W.Convex-concave hullfor classification with support vector machine[C]//Proceedings of the 12th International Conference on Data Mining Workshops.IEEE,2012:431-438.
[81]BIRZHANDI P,YOUN H Y.CBCH(clustering-based convexhull) for reducing training time of support vector machine[J].The Journal of Supercomputing,2019,75(8):5261-5279.
[82]BARBER C B,DOBKIN D P,HUHDANPAA H.The quickhull algorithm for convex hulls[J].Association for Computing Machinery,1996,22(4):469-483.
[83]WANG D,QIAO H,ZHAN B,et al.Online support vectormachine based on convex hull vertices selection[J].IEEE Transactions on Neural Networks and Learning Systems,2013,24(4):593-609.
[84]FRANC V,HLAVÁČ V.An iterative algorithm learning themaximal margin classifier[J].Pattern Recognition,2003,36(9):1985-1996.
[85]KHOSRAVANI H R,RUANO A E,FERREIRA P M.A convex hull-based data selection method for data driven models[J].Applied Soft Computing,2016,47:515-533.
[86]OSUNA E,CASTRO O D.Convex hull in feature space for support vector machines[C]//Ibero-American Conference on Artificial Intelligence.Springer,2002:411-419.
[87]SYED N A,LIU H,SUNG K K.Incremental learning with support vector machines[C]//Proceedings of the 2001 IEEE International Conference on Data Mining.IEEE,2001:641-642.
[88]XU J,XU C,ZOU B,et al.New incrementallearning algorithm with support vector machines[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2018,49(11):2230-2241.
[89]MITRA P,MURTHY C A,PAL S K.Data condensation inlarge databases by incremental learning with support vector machines[C]//Proceedings of the 15th International Conference on Pattern Recognition.IEEE,2000:708-711.
[90]CAUWENBERGHS G,POGGIO T.Incremental and decremental support vector machine learning[C]//Proceedings of the 13th International Conference on Neural Information Processing Systems.MIT Press,2000:388-394.
[91]KATAGIRI S,ABE S.Incremental training of support vectormachines using hyperspheres[J].Pattern Recognition Letters,2006,27(13):1495-1507.
[92]LI C,LIU K,WANG H.The incremental learning algorithmwith support vector machine based on hyperplane-distance[J].Applied Intelligence,2011,34(1):19-27.
[93]WU C M,WANG X D,BAI D Y,et al.Fast SVM incremental learning algorithm based on class boundary hull vectors[J].Computer Engineering and Applications,2010,46(23):185-187.
[94]WEN B,SHAN G L,DUAN X S.Research on incrementallearning algorithm based on KKT condition and hull vector [J].Computer Science,2013,40(3):4.
[95]REN P,XIAO Y,CHANG X,et al.A survey ofdeep activelearning[J].ACM Computing Surveys,2021,54(9):1-40.
[96]DEMIR B,PERSELLO C,BRUZZONE L.Batch-mode active-learning methods for the interactive classification of remote sen-sing images[J].IEEE Transactions on Geoscience and Remote Sensing,2010,49(3):1014-1031.
[97]SASSANO M.An empirical study of active learning with support vector machines for Japanese word segmentation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Association for Natural Language Processing.2002:505-512.
[98]SCHOHN G,COHN D.Less is more:active learning with support vector machines[C]//Proceedings of the 17th International Conference on Machine Learning.Morgan Kaufmann Publishers Inc.2000:839-846.
[99]TONG S,KOLLER D.Support vector machine active learning with applications to text classification[J].Journal of Machine Learning Research,2002,2(1):999-1006.
[100]XU Z,YU K,TRESP V,et al.Representative sampling for text classification using support vector machines[C]//European Conference on Information Retrieval.Springer,2003:393-407.
[101]LI M,SETHI I K.Confidence-based active learning[J].IEEETransactions on Pattern Analysis and Machine Intelligence,2006,28(8):1251-1261.
[102]XU H,LI L,GUO P,et al.Uncertainty SVM active learning algorithm based on convex hull and sample distance[C]//Proceedings of the 33th Chinese Control and Decision Conference.IEEE,2021:6815-6822.
[103]LENG Y,QI G H,XU X Y,et al.A new SVM active learning algorithm based on KNN[C]//Advanced Materials Research.Trans Tech Publications Ltd.,2014:2906-2909.
[104]XU H,BIE X,FENG H,et al.Multiclass SVM active learning algorithm based on decision directed acyclic graph and one versus one[J].Cluster Computing,2019,22(3):6241-6251.
[105]GOUDJIL M,KOUDIL M,BEDDA M,et al.A novel activelearning method using SVM for text classification[J].International Journal of Automation and Computing,2018,15(3):290-298.
[106]LEE Y J,MANGASARIAN O L.RSVM:Reduced support vector machines[C]//Proceedings of the 2001 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2001:1-17.
[107]CHANG C C,LEE Y J.Generating the reduced set by systema-tic sampling[C]//International Conference on Intelligent Data Engineering and Automated Learning.Springer,2004:720-725.
[108]CHIEN L J,CHANG C C,LEE Y J.Variant methods of reduced set selection for reduced support vector machines[J].Journal of Information Science and Engineering,2010,26(1):183-196.
[109]BALCÁZAR J,DAI Y,WATANABE O.A random samplingtechnique for training support vectormachines[C]//Interna-tional Conference on Algorithmic Learning Theory.Springer,2001:119-134.
[110]KAWULOK M,NALEPA J.Support vector machines training data selection using a genetic algorithm[C]//Proceedings of the 2012 Joint IAPR International Conference on Structural,Syntactic,and Statistical Pattern Recognition.2012:557-565.
[111]NALEPA J,KAWULOK M.Adaptive Genetic Algorithm to Select Training Data for Support Vector Machines[C]//Procee-dings of the 17th European Conference onApplications of Evolutionary Computation.Springer,2014:514-525.
[112]SHARIF I,CHAUDHURI D.A multiseed-based SVM classification technique for training sample reduction[J].Turkish Journal of Electrical Engineering and Computer Sciences,2019,27(1):595-604.
[113]SUÁREZ C A L,QUINTANA D,CERVANTES A.A surveyon machine learning for recurring concept drifting data streams[J].Expert Systems with Applications,2022,213(2):118934.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!