基于Hadoop的并行PSO-kmeans算法实现Web日志挖掘

计算机科学 ›› 2015, Vol. 42 ›› Issue (Z6): 470-473.

基于Hadoop的并行PSO-kmeans算法实现Web日志挖掘

马汉达,郝晓宇,马仁庆

江苏大学计算机科学与通信工程学院镇江212013,江苏大学计算机科学与通信工程学院镇江212013,江苏大学计算机科学与通信工程学院镇江212013

出版日期:2018-11-14 发布日期:2018-11-14

Parallel PSO-kmeans Algorithm Implementing Web Log Mining Based on Hadoop

MA Han-da, HAO Xiao-yu and MA Ren-qing

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 互联网技术的迅速发展,使得基于单一结点的Web日志挖掘变得十分困难,而Hadoop云平台的出现,为这类问题提供了新的解决方案。但传统的 Web日志挖掘聚类k-means算法对初始聚类中心的选择敏感等缺点,容易影响聚类准确率。针对这个问题,提出基于粒子群算法(PSO)的k-means算法,使得k-means算法不受初始聚类中心的影响,并且在Hadoop平台上实现了算法的MapReduce编程。实验结果证明:提出的改进算法,与传统的k-means算法相比,具有更高的聚类准确率；与串行单机算法相比,运行效率也有很大的提升。

Abstract: With the rapid development of Internet technology,Web log mining based on a single node becomes very difficult.The emergence of Hadoop cloud platform provides a new solution to this problem.However,the traditional Web log mining clustering algorithm k-means is sensitive to the initial cluster centers selection,so it will easily affect the accuracy of clustering.Thus for this problem,this paper proposed a k-means algorithm based on particle swarm optimization which makes the k-means algorithm not be affected by the initial cluster centers.And the algorithm is realized in the Hadoop MapReduce programming platform.Experimental results show that: compared with traditional k-means algorithm the proposed algorithm has the higher clustering accuracy,and compared with stand-alone serial algorithm, the operating efficiency improved greatly.

Key words: Hadoop,k-means,PSO,MapReduce,Web log mining

马汉达,郝晓宇,马仁庆. 基于Hadoop的并行PSO-kmeans算法实现Web日志挖掘[J]. 计算机科学, 2015, 42(Z6): 470-473. https://doi.org/

MA Han-da, HAO Xiao-yu and MA Ren-qing. Parallel PSO-kmeans Algorithm Implementing Web Log Mining Based on Hadoop[J]. Computer Science, 2015, 42(Z6): 470-473. https://doi.org/

参考文献

[1] 杨怡玲,管旭东,陆丽娜.一个简单的Web日志挖掘系统[J].上海交通大学学报,2000,4(7):35-37
[2] 孙玲芳,夏聪.Web使用挖掘在用户行为分析中的应用[J].江苏科技大学学报:自然科学版,2011,25(3):258-261
[3] 毛严奇,彭沛夫.基于MapReduce 的 Web 日志挖掘预处理[J].计算机与现代化,2013(9):35-36
[4] Wang J,Su X.An improved K-Means clustering algorithm[C]∥2011 IEEE 3rd International Conference on Communication Software and Networks(ICCSN).IEEE,2011:44-46
[5] 吕奕清,林锦贤.基于MPI的并行PSO混合K均值聚类算法[J].计算机应用,2011,31(2):428-431
[6] 傅涛,孙亚民.基于PSO的K-means算法及其在网络入侵检测中的应用[J].计算机科学,2011,8(5):54-55
[7] 周婷,张君瑛,罗成.基于Hadoop的K-means聚类算法的实现[J].计算机技术与发展,2013,23(7):18-20
[8] 周诗慧,殷建.Hadoop平台下的并行Web日志挖掘算法[J].计算机工程,2013,9(6):43-46
[9] Shvachko K,Kuang H,Radia S,et al.The hadoop distributedfile system[C]∥2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10
[10] 宋莹,沈奇威,王晶.基于Hadoop的Web日志预处理的设计与实现[J].电信工程技术与标准化,2011,4(11):85-86
[11] 张晓强.MapReduce在Web日志挖掘中的应用[D].成都:电子科技大学,2011
[12] 彭长生.基于Fisher判别的分布式K-Means聚类算法[J].江苏大学学报:自然科学版,2014,4(35):422-423
[13] Kennedy J,Eberhart R C.Particle swarm optimization[C]∥Proceedings of IEEE international conference on neural networks.Perth:[s.n.],1995:1942-1948
[14] 谢秀华,李陶深.一种基于改进PSO的K-means优化聚类算法[J].计算机技术与发展,2014,4(2):35-37
[15] McNabb A W,Monson C K,Seppi K D.Parallel pso using mapreduce[C]∥IEEE Congress on Evolutionary Computation,2007(CEC 2007).IEEE,2007:7-14

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed