基于Twitter Storm平台并行挖掘最稠密子图

计算机科学 ›› 2014, Vol. 41 ›› Issue (1): 274-278.

基于Twitter Storm平台并行挖掘最稠密子图

王金明,王远方

东南大学计算机科学与工程学院南京211189;东南大学计算机科学与工程学院南京211189

出版日期:2018-11-14 发布日期:2018-11-14

Parallel Mining of Densest Subgraph Based on Twitter Storm

WANG Jin-ming and WANG Yuan-fang

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 在大规模图结构数据中发现最稠密子图具有极其广泛的应用,如社区发现、垃圾邮件检测和论文引用关系抽取等。基于带标签的无向图,提出了查询标签集的概念,设计了一个可以快速发现最稠密子图的近似算法DSFLC(Densest Subgraph Finding based on Labelset Constraint ):用户提交自定义的查询标签集,算法便可保证在用户可以接受的时间内返回满足查询标签集约束的最稠密子图。对于任何参数ε(ε＞0),DSFLC算法只需扫描大规模数据集O(log1+εn)次,同时可保证算法的近似因子是2(1+ε)。对DSFLC算法进行分析后,发现该算法在预处理阶段易于并行化,因此选择Twitter Storm平台,并行化地实现了DSFLC算法。最后对从DBLP数据库中抽取的合作关系图进行测试,一方面研究Storm平台对算法的加速程度；另一方面分析挖掘出的子图的稠密度与参数ε之间的关系,最终验证了DSFLC算法的实用性和可扩展性。

关键词: 最稠密子图发现,查询标签集,DSFLC算法,Twitter Storm平台

Abstract: In large scale graph,finding densest subgraph has a wide range of applications,such as community discovery,spam detection and reference relation extraction．Based on tagged undirected graph,we introduced the concept of QLS and designed an approximation algorithm DSFLC which can quickly find the densest subgraph:users submit a QLS and the algorithm will return the densest subgraph under QLS within the time that user can accept．For any ε＞0,DSFLC only needs to scan large-scale data sets O(log1+εn)times,and can ensure the approximation algorithm is a 2(1＋ε)－approximation algorithm．After analyzing DSFLC,we found this algorithm is easy to parallelize,so we chose Twitter Storm platform to parallel DSFLC algorithm．Finally,the test data sets extracted from the DBLP database verify DSFLC’ practicality and scalability.

Key words: Densest subgraph finding,QLS,DSFLC algorithm,Twitter storm platform

王金明,王远方. 基于Twitter Storm平台并行挖掘最稠密子图[J]. 计算机科学, 2014, 41(1): 274-278. https://doi.org/

WANG Jin-ming and WANG Yuan-fang. Parallel Mining of Densest Subgraph Based on Twitter Storm[J]. Computer Science, 2014, 41(1): 274-278. https://doi.org/

参考文献

[1] Yon D,Filippo G,Marco P．Extraction and classification of dense communities in the Web[C]∥WWW 2007．2007:461-470
[2] Newman M E J．Modularity and community structure in net-works[C]∥PNAS 2006．2006:8577-8582
[3] William F G,Steve L,Lee G C．Efficient identification of Web communities[C]∥KDD 2000．2000:150-160
[4] https://github.com/nathanmarz/storm
[5] http://www.dblp.org/db/
[6] Goldberg A V．Finding a maximum density Subgraph[R]．UCB/CSD-84-171．EECS Department,University of California,Berkeley,1984
[7] Charikar M.Greedy approximation algorithms for finding dense components in a graph[C]∥APPROX 2000．2000:84-90
[8] Lawler F．Combinatorial Optimization:Networks and Matroids[M]．Holt,Rinehart,and Winston,1976
[9] Gibson D,Kumar R,Tomkins A．Discovering large dense subgraphs in massive graphs[C]∥VLDB 2005．2005:721-732
[10] Bahmani B,umar R．Sergei Vassilvitskii:Densest Subgraph inStreaming and MapReduce[C]∥VLDB 2012．2012:454-465
[11] Silva A,Meira W Jr,Zaki M J．Mining Attribute-structure Correlated Patterns in Large Attributed Graphs[C]∥VLDB．2012:466-477
[12] Jeffrey D,Sanjay G．MapReduce:simplified data processing onlarge clusters[C]∥ACM 2008．2008:107-113
[13] Bu Ying-yi,Howe B,Balazinska M,et al.HaLoop:Efficient Iterative Data Processing on Large Clusters[C]∥VLDB 2010．2010:285-296
[14] Abouzeid A,Bajda-Pawlikowski K,Abadi D J,et al.HadoopDB:An architectural hybrid of MapReduce and DBMS technologies for analytical workloads[C]∥VLDB 2009．2009:922-933
[15] Condie T,Conway N,Alvaro P,et al.MapReduce Online[C]∥NSDI’10Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation.USENIX Association Berkeley,CA,USA,2010:21

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed