计算机科学 ›› 2024, Vol. 51 ›› Issue (4): 11-18.doi: 10.11896/jsjkx.231100049

• 紧凑数据结构 • 上一篇    下一篇

基于MapReduce的大规模网络社区发现算法

王瀚橙, 戴海鹏, 陈志鹏, 陈树森, 陈贵海   

  1. 计算机软件新技术国家重点实验室(南京大学) 南京210023
  • 收稿日期:2023-11-06 修回日期:2024-01-05 出版日期:2024-04-15 发布日期:2024-04-10
  • 通讯作者: 戴海鹏(haipengdai@nju.edu.cn)
  • 作者简介:(hanchengwang@smail.nju.edu.cn)
  • 基金资助:
    国家自然科学基金(62272223,U22A2031,61872178)

Large-scale Network Community Detection Algorithm Based on MapReduce

WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai   

  1. State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China
  • Received:2023-11-06 Revised:2024-01-05 Online:2024-04-15 Published:2024-04-10
  • Supported by:
    National Natural Science Foundation of China(62272223,U22A2031,61872178).

摘要: 社区发现是社会网络挖掘领域的基本问题。随着海量数据的迅速产生,传统社区发现算法愈发难以处理大规模社会网络。因此,针对大规模网络设计高效的社区发现算法意义重大。文中提出了一种基于MapReduce和k中心聚类的新型分布式算法。首先,该算法提出“朋友圈系数”技术,该技术可更加准确地度量结点间的距离。其次,该算法提出“两阶段k中心聚类”技术,该技术在选取中心点过程中融入结点中心度启发式信息,可显著优化输出结果的模块度。最后,该算法提出“以模块度为优化目标的社区融合”技术,该技术能够在无先验知识的前提下自动确定网络中的社区数目。实验结果表明,所提算法的社区发现结果模块度明显优于最先进的社区发现算法。例如,相比LPA算法,其将模块度平均提升9.19倍。

关键词: 社区发现, k中心聚类, 分布式计算, 数据挖掘, 大数据

Abstract: Community detection is a fundamental problem in the field of social network mining.With the rapid generation of massive data,traditional community detection algorithms are becoming increasingly difficult to handle large-scale social networks.Therefore,it is of great significance to design efficient community detection algorithms for large-scale networks.This paper proposes a new distributed algorithm based on MapReduce and k-center clustering.Firstly,the algorithm proposes the “friend circle coefficient” technique,which can measure the distance between nodes more accurately.Secondly,the algorithm proposes the “two-stage k-center clustering” technique,which incorporates node centrality heuristic information into the process of selecting center points and can significantly optimize the modularity of the results.Finally,the algorithm proposes a “community fusion method with modularity as the optimization goal” technique,which can automatically determine the number of communities in the network without prior knowledge.The evaluation results show that the proposed algorithm significantly outperforms the state-of-the-art community discovery algorithms in terms of modularity.For example,compared with the LPA algorithm,the proposed algorithm increases the modularity by an average of 9.19 times.

Key words: Community detection, k-center clustering, Distributed computing, Data mining, Big data

中图分类号: 

  • TP391
[1]NEWMAN M E J,GIRVAN M.Finding and evaluating community structure in networks[J].Physical Review E,2004,69(2):1-16.
[2]GAO Y,ZHANG H L.Survey on community detection method based on random walk[J].Journal on Communications,2023,44(6):198-210.
[3]SANGAIAH A K,REZAEI S,JAVADPOUR A,et al.Explai-nable AI in big data intelligence of community detection for digitalization e-healthcare services[J].Applied Soft Computing,2023,136(1):1-10.
[4]YANG H L,ZHAO X,CHEN C,et al.Community detection in social network based on node influence expansion[J].Journal of Harbin University of Science and Technology,2023,28(3):10-19.
[5]GHAREHCHOPOGH F S.An improved harris hawks optimization algorithm with multi-strategy for community detection in social network[J].Journal of Bionic Engineering,2023,20(3):1175-1197.
[6]NAIK D,RAMESH D,GOROJANAM N B.Enhanced link prediction using sentiment attribute and community detection[J].Journal of Ambient Intelligence and Humanized Computing,2023,14(4):4157-4174.
[7]MITTAL A,GOEL A.A review on community detection approach for recommender systems[C]//Proceedings of the International Conference on Communication and Electronics Systems.2023:1521-1526.
[8]AMIRA A,DERHAB A,KARBAB E B,et al.A survey of malware analysis using community detection algorithms[J].ACM Computing Surveys,2023,56(2):1-29.
[9]FORTUNATO S.Community detection in graphs[J].Physics Reports,2010,486(3):75-174.
[10]NEWMAN M E J.Fast algorithm for detecting communitystructure in networks[J].Physical Review E,2004,69(6):1-5.
[11]RAGHAVAN U N,ALBERT R,KUMARA S.Near linear time algorithm to detect community structures in large-scale networks[J].Physical Review E,2007,76(3):1-11.
[12]KERNIGHAN B W,LIN S.An efficient heuristic procedure for partitioning graphs[J].The BellSystem Technical Journal,1970,49(2):291-307.
[13]LUO Z G,DING F,JIANG X Z,et al.New progress on community detection in complex networks[J].Journal of National University of Defense Technology,2011,33(1):47-52.
[14]SUN H,LIU J,HUANG J,et al.LinkLPA:A link-based label propagation algorithm for overlapping community detection in networks[J].Computational Intelligence,2017,33(2):308-331.
[15]FLAKE G W,LAWRENCE S,GILES C L.Efficient identification of web communities[C]//Proceedings of the International Conference on Knowledge Discovery and Data Mining.2000:150-160.
[16]NEWMAN M E J.Modularity and community structure in networks[J].Proceedings of the National Academy of Sciences,2006,103(23):8577-8582.
[17]CLAUSET A,NEWMAN M E J,MOORE C.Finding community structure in very large networks[J].Physical Review E,2004,70(6):1-6.
[18]PIZZUTI C.Evolutionary computation for community detection in networks:a review[J].IEEE Transactions on Evolutionary Computation,2017,22(3):464-483.
[19]LIM S,KIM J,LEE J G.BlackHole:Robust community detection inspired by graph drawing[C]//Proceedings of the International Conference on Data Engineering.2016:25-36.
[20]SINGH A,GARG S,BATRA S,et al.Probabilistic data structure-based community detection and storage scheme in online social networks[J].Future Generation Computer Systems,2019,94(1):173-184.
[21]BLOOM B H.Space/time trade-offs in hash coding with allowable errors[J].Communications of the ACM,1970,13(7):422-426.
[22]BENDER M A,FARACH-COLTON M,JOHNSON R,et al.Don't thrash:How to cache your hash on flash[J].Proceedings of the VLDB Endowment,2012,5(11):1627-1637.
[23]SHIOKAWA H,FUJIWARA Y,ONIZUKA M.Fast algorithm for modularity-based graph clustering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2013:1170-1176.
[24]RIEDY J,BADER D A,MEYERHENKE H.Scalable multi-threaded community detection in social networks[C]//Procee-dings of the International Parallel and Distributed Processing Symposium Workshops & PhD Forum.2012:1619-1628.
[25]YANG S,WANG B,ZHAO H,et al.Efficient dense structure mining using MapReduce[C]//Proceedings of the International Conference on Data Mining Workshops.2009:332-337.
[26]SHI J,XUE W,WANG W,et al.Scalable community detection in massive social networks using MapReduce[J].IBM Journal of Research and Development,2013,57(3):1-12.
[27]ZHANG Y,WANG J,WANG Y,et al.Parallel community de-tection on large networks with propinquity dynamics[C]//Proceedings of the International Conference on Knowledge Disco-very and Data Mining.2009:997-1006.
[28]WU Z,GAO G,BU Z,et al.SIMPLE:Simplifying-ensemblingframework for parallel community detection from large networks[J].Cluster Computing,2016,19(1):211-221.
[29]GONZALEZ T F.Clustering to minimize the maximum intercluster distance[J].Theoretical Computer Science,1985,38(C):293-306.
[30]ZACHARY W W.An information flow model for conflict andfission in small groups[J].Journal ofAnthropological Research,1977,33(4):452-473.
[31]LUSSEAU D,SCHNEIDER K,BOISSEAU O J,et al.The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations[J].Behavioral Ecology and Sociobiology,2003,54(4):396-405.
[32]GIRVAN M,NEWMAN M E J.Community structure in social and biological networks[J].Proceedings of the National Academy of Sciences,2002,99(12):7821-7826.
[33]LESKOVEC J,MCAULEY J J.Learning to discover social circles in ego networks[J].Advances in Neural Information Processing Systems,2012,1(1):539-547.
[34]ROZEMBERCZKI B,ALLEN C,SARKAR R.Multi-scale attributed node embedding[J].Journal of Complex Networks,2021,9(2):1-22.
[35]YANG J,LESKOVEC J.Defining and evaluating network communities based on ground-truth[J].Knowledge and Information Systems,2015,42(1):181-213.
[36]CHO E,MYERS S A,LESKOVEC J.Friendship and mobility[C]//Proceedings of the International Conference on Knowledge Discovery and Data Mining.2011:1082-1090.
[37]LESKOVEC J,KLEINBERG J,FALOUTSOS C.Graph evolu-tion:Densification and shrinking diameters[J].ACM Transactions on Knowledge Discovery from Data,2007,1(1):2-41.
[38]WANG H,DAI H,CHEN S,et al.Filterdata structures:A survey[J].Computer Science,2024,51(1):35-40.
[39]WANG H,DAI H,LI M,et al.Bamboo filters:Make resizing smooth[C]//Proceedings of International Conference on Data Engineering.2022:979-991.
[40]LI M,CHEN D,DAI H,et al.Seesaw counting filter:An efficient guardian for vulnerable negative keys during dynamic filtering[C]//Proceedings of the ACM Web Conference.2022:2759-2767.
[41]WANG H,DAI H,GU R,et al.Wormhole filters:Caching your hash on persistent memory[C]//Proceedings of European Conference on Computer Systems.2024:1-16.
[42]DAI H,YU J,LI M,et al.Bloom filter with noisy coding framework for multi-set membership testing[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(7):6710-6724.
[43]DAI H,LI M,LIU A X,et al.Finding persistent items in distributed datasets[J].IEEE/ACM Transactions on Networking,2020,28(1):1-14.
[44]DAI H,SHAHZAD M,LIU A X,et al.Identifying and estimating persistent items in data streams[J].IEEE/ACM Transactions on Networking,2018,26(6):2429-2442.
[45]XIA R,DAI H,DU Z,et al.Mining robust frequent items in data streams[C]//Proceedings of International Conference on Joint Cloud Computing.2020:110-117.
[46]LIU J,DAI H,XIA R,et al.DUET:A generic framework forfinding special quadratic elements in data streams[C]//Procee-dings of ACM Web Conference.2022:2989-2997.
[47]LI M,DAI H,WANG X,et al.Thresholded monitoring in distributed data streams[C]//Proceedings of International Confe-rence on Distributed Computing Systems.2019:218-227.
[48]DAI H,SHAHZAD M,LIU A X,et al.Finding persistent items in data streams[J].Proceedings of the VLDB Endowment,2016,10(4):289-300.
[49]FU P,LUO L,GUO D,et al.Jump filter:Dynamic sketch design for big data governance[J].Journal of Software,2023,34(3):1193-1212.
[50]XIE K,WEN J,ZHANG D,et al.Bloom filter query algorithm[J].Journal of Software,2009,20(1):96-108.
[51]YU S,LI X,WANG H,et al.C_CART:An instance confidence-based decision tree algorithm for classification[J].Intelligent Data Analysis,2021,25(4):929-948.
[52]YU S,LI X,WANG H,et al.BIDI:A classification algorithm with instance difficulty invariance[J].Expert Systems with Applications,2021,165(1):1-13.
[53]DAI H,LI M,WANG H,et al.Distantly supervised entity lin-king with selection consistency constraint[C]//Proceedings of International Conference on Database Systems for Advanced Applications.2023:784-799.
[54]DAI M,XU S,SHAO S,et al.Blockchainbased reliable fog cloud service solution for IIoT[J].Chinese Journal of Electronics,2021,30(2):359-366.
[55]Li P,YU X,XU H,et al.Secure localization technology based on dynamic trust management in wireless sensor networks[J].Chinese Journal of Electronics,2021,30(4):759-768.
[56]LI L,WU J,HU H,et al.Secure cloud architecture for 5G core network[J].Chinese Journal of Electronics,2021,30(3):516-522.
[57]DAI H,WANG H,CHEN Z,et al.Variable-length encodingframework:A generic framework for enhancing the accuracy of approximate membership queries[C]//Proceedings of the 23rd International Conference on Data Mining.2023:61-70.
[58]LI M,WANG H,DAI H,et al.A survey of multi-dimensional indexes:Past and future trends[J].IEEE Transactions on Knowledge and Data Engineering,2024,36(1):1-20.
[59]GU R,LI S,DAI H,et al.Adaptive online cache capacity optimization via lightweight working set size estimation at scale[C]//Proceedings of USENIX Annual Technical Conference.2023:467-484.
[60]DAI H,ZHONG Y,LIU A,et al.Noisy bloom filters for multi-set membership testing[C]//Proceedings of International Conference on Measurement and Modeling of Computer Science.2016:139-151.
[61]LI Y,WANG Z,YANG R,et al.Learned bloom filter for multi-key membership testing[C]//Proceedings of the International Conference on Database Systems for Advanced Applications.2023:62-79.
[62]DAYAN N,BERCEA I,REVIRIEGO P,et al.InfiniFilter:Expanding filters to infinity and beyond[J].Proceedings of the ACM on Management of Data,2023,1(2):1-27.
[63]PANDEY P,CONWAY A,DURIE J,et al.Vector quotient filters:Overcoming the time/space trade-off in filter design[C]//Proceedings of the International Conference on Management of Data.2021:1386-1399.
[64]PANDEY P,BENDER M A,JOHNSON R,et al.A general-purpose counting filter:Making every bit count[C]//Proceedings of the International Conference on Management of Data.2017:775-787.
[65]HUANG K,YANG T.Tagged cuckoo filters[C]//Proceedings of the International Conference on Computer Communications and Networks.2021:1-10.
[66]ZENG Z,XIAO R,LIN X,et al.Double locality sensitive ha-shing Bloom filter for high-dimensional streaming anomaly detection[J].Information Processing and Management,2023,60(3):1-18.
[67]QU Y,SUN H,DONG C,et al.Elastic collaborative edge intelligence for UAV swarm:Architecture,challenges,and opportunities[J].IEEE Communications Magazine,2024,62(1):62-68.
[68]HE S,SHI K,LIU C,et al.Collaborative sensing in internet of things:A comprehensive survey[J].IEEE Communications Surveys & Tutorials,2022,24(3):1435-1474.
[69]DENG X,JIANG Y,YANG L T,et al.Data fusion based cove-rage optimization in heterogeneous sensor networks:A survey[J].Information Fusion,2019,52(1):90-105.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!