计算机科学 ›› 2017, Vol. 44 ›› Issue (12): 33-37.doi: 10.11896/j.issn.1002-137X.2017.12.006

• 第四届CCF大数据学术会议 • 上一篇    下一篇

基于Spark的极限学习机算法并行化研究

刘鹏,王学奎,黄宜华,孟磊,丁恩杰   

  1. 中国矿业大学物联网感知矿山研究中心 徐州221008;矿山互联网应用技术国家地方联合工程实验室 徐州221008,中国矿业大学物联网感知矿山研究中心 徐州221008;中国矿业大学信息与控制工程学院 徐州221116,南京大学计算机系PASA大数据实验室 南京210023,中国矿业大学物联网感知矿山研究中心 徐州221008;矿山互联网应用技术国家地方联合工程实验室 徐州221008,中国矿业大学物联网感知矿山研究中心 徐州221008;矿山互联网应用技术国家地方联合工程实验室 徐州221008;中国矿业大学信息与控制工程学院 徐州221116
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家重点研发计划:矿山安全生产物联网关键技术与装备研发(2017YFC0804400,2017YFC0804401),国家自然科学基金项目(61471361,3)资助

Study of ELM Algorithm Parallelization Based on Spark

LIU Peng, WANG Xue-kui, HUANG Yi-hua, MENG Lei and DING En-jie   

  • Online:2018-12-01 Published:2018-12-01

摘要: 极限学习机算法虽然训练速度较快,但包含了大量矩阵运算,因此其在面对大数据量时,处理效率依然缓慢。在充分研究Spark分布式数据集并行计算机制的基础上,设计了核心环节矩阵乘法的并行计算方案,并对基于Spark的极限学习机并行化算法进行了设计与实现。为方便性能比较,同时实现了基于Hadoop MapReduce的极限学习机并行化算法。实验结果表明,基于Spark的极限学习机并行化算法相比于Hadoop MapReduce版本的运行时间明显缩短,而且若处理数据量越大,Spark在效率方面的优势就越明显。

关键词: 极限学习机,并行化,Spark,RDD,Hadoop,MapReduce

Abstract: Extreme learning mechine(ELM) has high training speed,but with lots of matrix operations, it remains poor efficiency while applied to massive amount of data.After thorough research on parallel computation of Spark resilient distributed dataset (RDD),we proposed and implemented a parallelized algorithm of ELM based on Spark.And for convenience of performance comparison,Hadoop-MapReduce-based version was also implemented.Experimental results show that the training efficiency of the Spark-based ELM parallelization algorithm is significantly improved than the Hadoop-MapReduce-based version.If the amount of data processed is greater,the advantage of Spark in efficiency is more obvious.

Key words: ELM,Parallelization,Spark,RDD,Hadoop,MapReduce

[1] HUANG G B,ZHU Q Y,SIEW C K.Extreme learning ma-chine:theory and applications[J].Neurocomputing,2006,0(1):489-501.
[2] HUANG G B,WANG D H,YUAN L.Extreme learning machines:a survey [J].Int.J.Mach.Learn.& Cyber,2011,2(2):107-122.
[3] HUANG G B,DING X J,ZHOU H M.Optimization method based extreme learning machine for classification[J].Neurocomputing,2010,4(1):155-163.
[4] HE Q,SHANG T F,ZHUANG F Z,et al.Parallel extreme learning machine for regression based on MapReduce[J].Neurocomputing,2013,2(1):52-58.
[5] 安俊秀,王鹏,靳宇倡.Hadoop大数据处理技术基础与实践[M].北京:人民邮电出版社,2015:15-45.
[6] 王晓华.MapReduce2.0源码分析与编程实践[M].北京:人民邮电出版社,2014:21-60.
[7] CHEN J,CHEN H,WAN X Y,et al.MR-ELM:a MapReduce-based framework for large-scale ELM training in big data era [J].Neural Computing & Applications,2016,7(1):101-110.
[8] 夏俊鸾,刘旭辉,邵赛赛,等.Spark大数据处理技术[M].北京:电子工业出版社,2015.
[9] Pentreath N.Spark机器学习[M].蔡立宇,等译.北京:人民邮电出版社,2015:32-56.
[10] LIU Z Q,GU R,YUAN C,et al.Review of the parallelization of the classification algorithm based on SparkR[J].Journal of Frontiers of Computer Science and Technology,2015,9(11):1281-1294.(in Chinese) 刘志强,顾荣,袁春,等.基于SparkR的分类算法并行化研究[J].计算机科学与探索,2015,9(11):1281-1294.
[11] FERRARI S,STENGEL R F.Smooth function approximationusing neural networks [J].IEEE Transactions on Neural Networks,2005,6(1):24-38.
[12] HUANG Y H,GU R,GAO X K.The method of parallelization of the computing of the inverse matrix of distributed dense matrix based on Spark:China,CN 105373517 A[P].2016-03-02.(in Chinese) 黄宜华,顾荣,高兴坤.基于Spark的分布式稠密矩阵求逆并行化运算方法:中国,CN 105373517 A[P].2016-03-02.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!