计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 204-207.

• 数据科学 • 上一篇    下一篇

基于MapReduce的强连通网格聚类算法

胡赢双, 陆亿红   

  1. (浙江工业大学计算机科学与技术学院 杭州310014)
  • 出版日期:2019-11-10 发布日期:2019-11-20
  • 通讯作者: 陆亿红(1968-),女,硕士,副教授,主要研究方向为软件理论、数据挖掘等,E-mail:lyh@zjut.edu。
  • 作者简介:胡赢双(1995-),女,硕士生,主要研究方向为数据挖掘,E-mail:huyingshuang2@163.com。
  • 基金资助:
    本文受浙江省基础公益研究计划项目(GG19E090005)资助。

Cell Clustering Algorithm Based on MapReduce and Strongly Connected Fusion

HU Ying-shuang, LU Yi-hong   

  1. (College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310014,China)
  • Online:2019-11-10 Published:2019-11-20

摘要: 随着位置大数据的爆炸式增长,传统的串行算法已无法对其进行高效地聚类处理,因此,基于MapReduce框架的并行聚类算法研究逐渐成为热点。聚类算法并行化后的聚类质量通常难以保证,因此对并行化聚类结果进行归约的方法极为重要。首先提出基于网格的改进DBSCAN并行化聚类算法,通过该步骤得到每个数据子集的聚类结果。然后在分析网格与簇的关系,定义网格簇和网格簇的连通、强连通概念的基础上,通过计算网格簇之间的连通权值矩阵,对具有强连通关系的网格簇进行归约,构成基于MapReduce的强连通网格聚类算法。该算法可实现位置大数据集的高效聚类。实验分析表明,基于MapReduce的强连通网格聚类算法对位置大数据的处理具有较高的效率和聚类质量。

关键词: DBSCAN, MapReduce, 强连通, 网格, 位置大数据

Abstract: With the explosive growth of large location data,most of the traditional serial clustering algorithms can not process big data efficiently.In order to solve this problem,more and more people are studying parallel clustering algorithm.It is difficult to guarantee the clustering quality of parallel clustering algorithm,so it is important to study the algorithm of reducing the result of parallel clustering.Therefore,a grid clustering algorithm based on strongly connected fusion was proposed.Firstly,clustering result of data subsets is obtained according to the improved DBSCAN algorithm based on MapReduce.Next,the relationship between grid and cluster is analyzed and the concepts of Gird-cluster,connectivity and strong connectivity of Gird-clusters are defined.Then the connectivity weights matrix between Gird-cluster and Gird-cluster is calculated.Finally,whether to reduce two Gird-clusters or not is decided according to connectivity weight.The experimental results show that the proposed algorithm has high efficiency and high clustering quality in processing large location data.

Key words: Big data of position, DBSCAN, Gird, MapReduce, Strongly connected fusion

中图分类号: 

  • TP274
[1]刘经南,方媛,郭迟,等.位置大数据的分析处理研究进展[J].武汉大学学报(信息科学版),2014,39(4):379-385.
[2]YUAN J,ZHENG Y,XIE X,et al.T-Drive:Enhancing driving directions with taxi drivers’ intelligence[J].IEEE Trans.on Knowledge & Data Engineering,2013,25(1):220-232.
[3]ZHENG Y,XIE X,MA W Y.GeoLife:A collaborative social networking service among user,location and trajectory[J].Bulletin of the TechnicalCommittee on Data Engineering,2010,33(2):32-39.
[4]YUAN J,ZHENG Y,XIE X.Discovering regions of differentfunctions in a city using humanmobility and POIs[C]∥Know-ledge Discovery and Data Mining.ACM Press,2012:186-194.
[5]郭迟,刘经南,方媛,等.位置大数据的价值提取与协同挖掘方法[J].软件学报,2014,25(4):713-730.
[6]林乐轩.基于位置大数据的行人路径预测及人群密度预估系统研究[D].北京:北京邮电大学,2018.
[7]TOBLER W,DEICHMANN U,GOTTSEGEN J,et al.World population in a grid of spherical quadrilaterals[J].International Journal of Population Geography,1997,3(3):203-225.
[8]李斯凡.基于无监督学习技术的位置大数据分析[D].杭州:浙江理工大学,2017.
[9]GUTTMAN A.R-trees:A dynamic index structure for spatial searching[C]∥International Conference on Management of Data.Boston:1984:47-57.
[10]ZHAO Q,SHI Y,LIU Q,et al.A grid-growing clustering algorithm for geospatial Data[J].Pattern Recognition Letters,2014,53(53):77-84.
[11]KUMAR K M,REDDY A R M.A fast DBSC-AN clustering algorithm by accelerating neighbor searching using Groups me-thod[J].Pattern Recognition,2016,58:39-48.
[12]HE Y,TAN H,LUO W,et al.MR-DBSCAN:An efficient parallel density-based clustering algorithm using MapReduce[C]∥2011 IEEE 17th International Conference on Parallel and Distributed Systems.IEEE Computer Society,2011:473-480.
[13]KIM Y,SHIM K,KIM M S,et al.DBCURE-MR:An efficient density-based clustering algorithm for large data using MapReduce[J].Information Systems,2014,42(2):15-35.
[14]于彦伟,贾召飞,曹磊,等.面向位置大数据的快速密度聚类算法[J].软件学报,2018,29(8):2470-2484.
[15]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]∥International Conference on Knowledge Discovery & Data Mining.Portland:AAAI Press,1996:226-231.
[16]黄德才.数据仓库与数据挖掘教程[M].北京:清华大学出版社,2016.
[17]钱潮恺,黄德才.基于维度频率相异度和强连通融合的混合数据聚类算法[J].模式识别与人工智能,2016,29(1):82-89.
[18]余长俊,张燃.云环境下基于Canopy聚类的FCM算法研究[J].计算机科学,2014,41(S1):316-319.
[19]GIONIS A,MANNILA H,TSAPARAS P.Clustering aggregation[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):1-30.
[20]ZAHN C T.Graph-theoretical methods for detecting and de-scribing gestalt clusters[J].IEEE Transactions on Computers,1971,100(1):68-86.
[21]YUAN J,ZHENG Y,XIE X,et al.T-Drive:Enhancing driving directions with taxi drivers’ intelligence[J].IEEE Transactions on Knowledge & Data Engineering,2013,25(1):220-232.
[1] 程富豪, 徐泰华, 陈建军, 宋晶晶, 杨习贝.
基于顶点粒k步搜索和粗糙集的强连通分量挖掘算法
Strongly Connected Components Mining Algorithm Based on k-step Search of Vertex Granule and Rough Set Theory
计算机科学, 2022, 49(8): 97-107. https://doi.org/10.11896/jsjkx.210700202
[2] 刘卫明, 安冉, 毛伊敏.
基于聚类和WOA的并行支持向量机算法
Parallel Support Vector Machine Algorithm Based on Clustering and WOA
计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[3] 叶跃进, 李芳, 陈德训, 郭恒, 陈鑫.
基于国产众核架构的非结构网格分区块重构预处理算法研究
Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture
计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045
[4] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[5] 封雷, 朱登明, 李兆歆, 王兆其.
一种基于遮罩的稀疏点云滤波算法
Sparse Point Cloud Filtering Algorithm Based on Mask
计算机科学, 2022, 49(5): 25-32. https://doi.org/10.11896/jsjkx.210600129
[6] 刘江, 刘文博, 张矩.
OpenFoam中多面体网格生成的MPI+OpenMP混合并行方法
Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam
计算机科学, 2022, 49(3): 3-10. https://doi.org/10.11896/jsjkx.210700060
[7] 张仁杰, 陈伟, 杭梦鑫, 吴礼发.
基于变分自编码器的不平衡样本异常流量检测
Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder
计算机科学, 2021, 48(7): 62-69. https://doi.org/10.11896/jsjkx.200600022
[8] 张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚.
面向MapReduce的中间数据传输流水线优化机制
Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework
计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103
[9] 郭杰, 高希然, 陈莉, 傅游, 刘颖.
用数据驱动的编程模型并行多重网格应用
Parallelizing Multigrid Application Using Data-driven Programming Model
计算机科学, 2020, 47(8): 32-40. https://doi.org/10.11896/jsjkx.200500093
[10] 程盛淦, 于浩然, 韦建文, 林新华.
基于定点压缩技术的双层粒子网格算法的设计与优化
Design and Optimization of Two-level Particle-mesh Algorithm Based on Fixed-point Compression
计算机科学, 2020, 47(8): 56-61. https://doi.org/10.11896/jsjkx.200200112
[11] 任帅, 王萌, 范傲雄, 高泽, 徐解, Shahzad KHURRAM, 张弢.
一种零高分辨率3D网格模型的信息隐藏算法
Zero-high-resolution Information Hiding Algorithm for 3D Mesh Model
计算机科学, 2020, 47(7): 328-334. https://doi.org/10.11896/jsjkx.190800021
[12] 罗晋楠, 张济民.
基于扩展Haar特征和DBSCAN的钢轨识别算法
Rail Area Extraction Using Extended Haar-like Features and DBSCAN Clustering
计算机科学, 2020, 47(6A): 153-156. https://doi.org/10.11896/JsJkx.200100008
[13] 庞荣来,林静,张磊.
网格驱动的双向图像拼接算法
Grid-driven Bi-directional Image Stitching Algorithm
计算机科学, 2020, 47(3): 130-136. https://doi.org/10.11896/jsjkx.190100239
[14] 郑磊, 吴俊威, 林俊勉, 潘翔.
交互标记约束的三维网格序列分割
Marker-constrained Interactive Segmentation of 3D Animated Meshes
计算机科学, 2020, 47(11A): 271-275. https://doi.org/10.11896/jsjkx.200400030
[15] 邓定胜.
一种改进的DBSCAN算法在Spark平台上的应用
Application of Improved DBSCAN Algorithm on Spark Platform
计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!