基于Spark的并行SVM算法研究

doi:10.11896/j.issn.1002-137X.2016.05.044

计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 238-242.doi: 10.11896/j.issn.1002-137X.2016.05.044

基于Spark的并行SVM算法研究

刘泽燊,潘志松

解放军理工大学指挥信息系统学院南京210007,解放军理工大学指挥信息系统学院南京210007

出版日期:2018-12-01 发布日期:2018-12-01
基金资助:
本文受国家自然科学基金项目(61473149)资助

Research on Parallel SVM Algorithm Based on Spark

LIU Ze-shen and PAN Zhi-song

Online:2018-12-01 Published:2018-12-01

摘要/Abstract

摘要： 随着数据规模的不断增加,支持向量机(SVM)的并行化设计成为数据挖掘领域的一个研究热点。针对SVM算法训练大规模数据时存在寻优速度慢、内存占用大等问题,提出了一种基于Spark平台的并行支持向量机算法(SP-SVM)。该方法通过调整层叠支持向量机(Cascade SVM)的合并策略和训练结构,并利用Spark分布式计算框架实现；其次,进一步分析并行操作算子的性能,优化算法并行化实现方案,有效克服了层叠模型训练效率低的缺点。实验结果表明,新的并行训练方法在损失较小精度的前提下,在一定程度上减少了训练时间,能够很好地提高模型的学习效率。

关键词: 并行计算,支持向量机,大规模数据,层叠模型,Spark

Abstract: With the constant increasing of data scale,the parallel design of support vector machine(SVM) has become a hot research topic in data mining field.In view of the problems in model training including slow optimization and large memory,we proposed a new parallel SVM algorithm(SP-SVM) based on Spark.First of all,this paper implemented algorithm using Spark parallel computing framework.Secondly,this paper analyzed the performance of the parallel operator and optimized the algorithm in parallel design scheme,solving the problem of low efficiency that cascade training model encounters.Experimental results show that the new parallel training method can save more training time and greatly improve the efficiency in the case of a small precision loss.

Key words: Parallel computing,Support vector machine,Large scale data,Cascade model,Spark

刘泽燊,潘志松. 基于Spark的并行SVM算法研究[J]. 计算机科学, 2016, 43(5): 238-242. https://doi.org/10.11896/j.issn.1002-137X.2016.05.044

LIU Ze-shen and PAN Zhi-song. Research on Parallel SVM Algorithm Based on Spark[J]. Computer Science, 2016, 43(5): 238-242. https://doi.org/10.11896/j.issn.1002-137X.2016.05.044

参考文献

[1] Vapnik V N.The Nature of Statistical Learning Theory[M].Springer New York,1995:988-999
[2] Chang C C,Lin C J.LIBSVM:a Library for Support Vector Machines[J].ACM Transactions on Intelligent Systems & Technology,2006,2(3):389-396
[3] Dong J X,Krzyzak A,Suen C Y.Fast SVM training algorithm with decomposition on very large data sets[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(4):603-618
[4] Lin C Y,Tsai C H,Lee C P,et al.Large-scale logistic regression and linear support vector machines using spark[C]∥2014 IEEE International Conference on Big Data.IEEE,2014:519-528
[5] Zhang Wei,Zhang Gong-xuan,Wang Yong-li,et al.Research on parallel SVM algorithm based on CUDA[J].Computer Science,2013,40(4):69-72(in Chinese) 张巍,张功萱,王永利,等.基于CUDA的SVM算法并行化研究[J].计算机科学,2013,40(4):69-72
[6] Graf H P,Cosatto E,Bottou L,et al.Parallel Support VectorMachines:The Cascade SVM[C]∥Advances in Neural Information Processing Systems(NIPS).2004:521-528
[7] Sun Zhan-quan,Fox G.Study on Parallel SVM Based on MapReduce[C]∥The 2012 International Conference on Parallel and Distributed Processing Techniques and Applications.Las Vegas NV USA,2012
[8] Dean J,Ghemawat S.MapReduce:Simplified Data Processing on Large Clusters[J].Proceedings of Operating Systems Design and Implementation(OSDI),2004,51(1):107-113
[9] Zhang Peng-xiang,Liu Li-min,Ma Zhi-qiang.Research of parallel SVM algorithm based on MapReduce[J].Computer Applications and Software,2015,32(3):172-176(in Chinese) 张鹏翔,刘利民,马志强.基于MapReduce的层叠分组并行SVM算法研究[J].计算机应用与软件,2015,32(3):172-176
[10] Zaharia M,Chowdhury M,Franklin M J,et al.Spark:clustercomputing with working sets[C]∥Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing USENIX Association.2010:10
[11] http://spark.apache.org
[12] Zaharia M,Chowdhury M,Das T,et al.Resilient distributeddatasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.2012:141-146
[13] Guo Xin-xin.SVM optimization algorithm based on Distributed Computing[D].Xi’an:Xi’an Electronic and Science University,2014(in Chinese) 郭欣欣.基于分布式计算的SVM算法优化[D].西安:西安电子科技大学,2014
[14] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于Spark的并行SVM算法研究

Research on Parallel SVM Algorithm Based on Spark

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0