计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 238-242.doi: 10.11896/j.issn.1002-137X.2016.05.044

• 人工智能 • 上一篇    下一篇

基于Spark的并行SVM算法研究

刘泽燊,潘志松   

  1. 解放军理工大学指挥信息系统学院 南京210007,解放军理工大学指挥信息系统学院 南京210007
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目(61473149)资助

Research on Parallel SVM Algorithm Based on Spark

LIU Ze-shen and PAN Zhi-song   

  • Online:2018-12-01 Published:2018-12-01

摘要: 随着数据规模的不断增加,支持向量机(SVM)的并行化设计成为数据挖掘领域的一个研究热点。针对SVM算法训练大规模数据时存在寻优速度慢、内存占用大等问题,提出了一种基于Spark平台的并行支持向量机算法(SP-SVM)。该方法通过调整层叠支持向量机(Cascade SVM)的合并策略和训练结构,并利用Spark分布式计算框架实现;其次,进一步分析并行操作算子的性能,优化算法并行化实现方案,有效克服了层叠模型训练效率低的缺点。实验结果表明,新的并行训练方法在损失较小精度的前提下,在一定程度上减少了训练时间,能够很好地提高模型的学习效率。

关键词: 并行计算,支持向量机,大规模数据,层叠模型,Spark

Abstract: With the constant increasing of data scale,the parallel design of support vector machine(SVM) has become a hot research topic in data mining field.In view of the problems in model training including slow optimization and large memory,we proposed a new parallel SVM algorithm(SP-SVM) based on Spark.First of all,this paper implemented algorithm using Spark parallel computing framework.Secondly,this paper analyzed the performance of the parallel operator and optimized the algorithm in parallel design scheme,solving the problem of low efficiency that cascade training model encounters.Experimental results show that the new parallel training method can save more training time and greatly improve the efficiency in the case of a small precision loss.

Key words: Parallel computing,Support vector machine,Large scale data,Cascade model,Spark

[1] Vapnik V N.The Nature of Statistical Learning Theory[M].Springer New York,1995:988-999
[2] Chang C C,Lin C J.LIBSVM:a Library for Support Vector Machines[J].ACM Transactions on Intelligent Systems & Technology,2006,2(3):389-396
[3] Dong J X,Krzyzak A,Suen C Y.Fast SVM training algorithm with decomposition on very large data sets[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(4):603-618
[4] Lin C Y,Tsai C H,Lee C P,et al.Large-scale logistic regression and linear support vector machines using spark[C]∥2014 IEEE International Conference on Big Data.IEEE,2014:519-528
[5] Zhang Wei,Zhang Gong-xuan,Wang Yong-li,et al.Research on parallel SVM algorithm based on CUDA[J].Computer Science,2013,40(4):69-72(in Chinese) 张巍,张功萱,王永利,等.基于CUDA的SVM算法并行化研究[J].计算机科学,2013,40(4):69-72
[6] Graf H P,Cosatto E,Bottou L,et al.Parallel Support VectorMachines:The Cascade SVM[C]∥Advances in Neural Information Processing Systems(NIPS).2004:521-528
[7] Sun Zhan-quan,Fox G.Study on Parallel SVM Based on MapReduce[C]∥The 2012 International Conference on Parallel and Distributed Processing Techniques and Applications.Las Vegas NV USA,2012
[8] Dean J,Ghemawat S.MapReduce:Simplified Data Processing on Large Clusters[J].Proceedings of Operating Systems Design and Implementation(OSDI),2004,51(1):107-113
[9] Zhang Peng-xiang,Liu Li-min,Ma Zhi-qiang.Research of parallel SVM algorithm based on MapReduce[J].Computer Applications and Software,2015,32(3):172-176(in Chinese) 张鹏翔,刘利民,马志强.基于MapReduce的层叠分组并行SVM算法研究[J].计算机应用与软件,2015,32(3):172-176
[10] Zaharia M,Chowdhury M,Franklin M J,et al.Spark:clustercomputing with working sets[C]∥Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing USENIX Association.2010:10
[11] http://spark.apache.org
[12] Zaharia M,Chowdhury M,Das T,et al.Resilient distributeddatasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.2012:141-146
[13] Guo Xin-xin.SVM optimization algorithm based on Distributed Computing[D].Xi’an:Xi’an Electronic and Science University,2014(in Chinese) 郭欣欣.基于分布式计算的SVM算法优化[D].西安:西安电子科技大学,2014
[14] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!