计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 28-35.doi: 10.11896/j.issn.1002-137X.2019.04.005

• 大数据与数据科学 • 上一篇    下一篇

面向大规模图数据的分布式子图匹配算法

许文, 宋文爱, 富丽贞, 吕伟   

  1. 中北大学软件学院 太原030051
  • 收稿日期:2018-09-27 出版日期:2019-04-15 发布日期:2019-04-23
  • 通讯作者: 宋文爱(1964-),女,博士后,教授,主要研究方向为云计算与大数据、图形图像处理,E-mail:188425686@qq.com(通信作者)
  • 作者简介:许 文(1993-),男,硕士生,主要研究方向为云计算与大数据,E-mail:18735181961@163.com;富丽贞(1982-),女,博士生,主要研究方向为云计算与大数据、图数据管理;吕 伟(1992-),男,硕士生,主要研究方向为云计算与大数据。
  • 基金资助:
    本文受国家自然科学基金(61602427),山西省自然科学基金(201601D202037)资助。

Distributed Subgraph Matching Algorithm for Large Scale Graph Data

XU Wen, SONG Wen-ai, FU Li-zhen, LV Wei   

  1. School of Software,North University of China,Taiyuan 030051,China
  • Received:2018-09-27 Online:2019-04-15 Published:2019-04-23

摘要: 图数据规模的爆发式增长使在单机上的子图匹配变得较为困难。尽管现有的分布式算法可以在一定程度上解决大规模图数据的子图匹配问题,但分布式环境中的网络通信代价仍然影响着算法的性能。为此,文中提出了DSsearch分布式子图匹配算法,包含查询图拆分、数据图预处理、候选顶点过滤、中间结果合并 4个步骤。其中,在数据图预处理步骤中使用图划分和完善邻居顶点策略来降低匹配过程中分布式计算节点之间的通信代价;在过滤候选顶点阶段设计DSgraph存储结构存储候选顶点,通过推迟笛卡尔积来减少冗余的中间结果。最后设计了对比实验并在具有7个计算节点的Spark分布式集群上使用真实数据集进行验证。实验结果表明,DSsearch算法能够在秒级时间内完成对百万规模顶点的数据图的子图匹配,尤其是在处理复杂查询图和稠密数据图方面更高效。数据图预处理策略的实验结果说明了通过顶点复制来降低分布式环境中网络通信代价这一策略的可行性。相比TwinTwigJoin、PSgL等算法,随着查询图顶点数量的增加,DSsearch算法的运行时间增长得更缓慢,当查询图顶点数量达到14时,其运行时间是TwinTwigJoin和PSgL算法的一半。实验数据充分说明,分布式环境中的网络通信代价和中间结果数量是影响分布式子图匹配算法的主要因素。实现数据图的预处理和推迟笛卡尔积解决了分布式子图匹配的性能瓶颈问题,有效地完成了大规模图数据的子图匹配。

关键词: 分布式, 图划分, 图数据, 子图查询, 子图匹配

Abstract: The explosive growth of graph data size makes it difficult to process subgraph matching on single machine.Although the existing distributed algorithms could solve the problem to some extent,the network communication cost among the distributed computing nodes still affects the performance of algorithm.To solve this problem,this paper proposed a distributed subgraph matching algorithm,named DSsearch.It includes four steps:spliting query graph,preprocessing data graph,filtering candidate and combining intermediate results.In the data graph preprocessing,the strategy of graph partitioning and perfect neighbor vertex are used to reduce the communication cost among the distributed computing nodes.The DSgraph storage structure is designed to store candidate vertices during the filtering candidate vertex stage,and the redundant intermediate results are reduced by delaying the Cartesian product.Finally,a comparative experiment was designed and real data sets verification was performed on a Spark distributed cluster with 7 compute nodes.The experimental results show that the DSsearch algorithm can complete subgraph matching of data graphs of millions of vertices in seconds,especially in dealing with complex query graphs and dense data graphs.The experimental results of the data graph preprocessing strategy illustrate the feasibility of reducing network communication cost in a distributed environment through vertex replication.Compared with other algorithms such as TwinTwigJoin and PSgL,the increase of running time of DSsearch algorithm becomes slowly as the number of vertices in the query graph increases.When the number of query graph vertices reaches 14,the running time is only half of that of TwinTwigJoin and PSgL algorithms.The experimental data fully demonstrates that the data transmission cost and the number of intermediate results in the distributed system are the main factors affecting the efficiency of distributed subgraph matching algorithm.Realization of preprocessing the data graph and delaying the Cartesian product solves the bottleneck problem of the performance of distributed subgraph matching and effectively completes the subgraph matching of large-scale graph data.

Key words: Distributed, Graph data, Graph partition, Subgraph matching, Subgraph query

中图分类号: 

  • TP301
[1]ULLMANN J R.An algorithm for subgraph isomorphism[J].Journal of the ACM,1976,23(1):31-42.
[2]CORDELLA L P,FOGGIA P,SANSONE C,et al.A (sub)
graph isomorphism algorithm for matching large graphs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(10):1367-1372.
[3]LEE J,HAN W S,KASPEROVICS R,et al.An in-depth comparison of subgraph isomorphism algorithms in graph databases[J].Proceedings of the VLDB Endowment,2012,6(2):133-144.
[4]HAN W S,LEE J,LEE J H.Turbo iso:towards ultrafast and
robust subgraph isomorphism search in large graph databases[C]∥Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:337-348.
[5]BI F,CHANG L,LIN X,et al.Efficient subgraph matching by postponing cartesian products[C]∥Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1199-1214.
[6]GIUGNO R,SHASHA D.Graphgrep:A fast and universal
method for querying graphs[C]∥16th International Conference on Pattern Recognition.IEEE,2002:112-115.
[7]ZOU L,CHEN L,YU J X,et al.A novel spectral coding in a large graph database[C]∥Proceedings of the 11th International Conference on Extending Database Technology:Advances in database technology.ACM,2008:181-192.
[8]HE H,SINGH A K.Closure-tree:An index structure for graph queries[C]∥Proceedings of the 22nd International Conference on Data Engineering(ICDE’06).IEEE,2006:38-38.
[9]SUN Z,WANG H,WANG H,et al.Efficient subgraph matc-
hing on billion node graphs[J].Proceedings of the VLDB Endowment,2012,5(9):788-799.
[10]SHAO B,WANG H,LI Y.Trinity:A distributed graph engine on a memory cloud[C]∥Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:505-516.
[11]ZHAO Z,WANG G,BUTT A R,et al.Sahad:Subgraph analysis in massive networks using hadoop[C]∥2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS).IEEE,2012:390-401.
[12]ALON N,DAO P,HAJIRASOULIHA I,et al.Biomolecular
network motif counting and discovery by color coding[J].Bioinformatics,2008,24(13):i241-i249.
[13]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-
sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[14]LAI L,QIN L,LIN X,et al.Scalable subgraph enumeration in mapreduce[J].Proceedings of the VLDB Endowment,2015,8(10):974-985.
[15]SHAO Y,CUI B,CHEN L,et al.Parallel subgraph listing in a large-scale graph[C]∥Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.ACM,2014:625-636.
[16]REZA T,KLYMKO C,RIPEANU M,et al.Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs[C]∥2017 IEEE International Conference on Cluster Computing (CLUSTER).IEEE,2017:1-12.
[17]SUO B,LI Z,PAN W.Parallel subgraph matching on massive graphs[C]∥International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI).IEEE,2016:1932-1937.
[18]BENLIC U,HAO J K.A multilevel memetic approach for improving graph k-partitions[J].IEEE Transactions on Evolutiona-ry Computation,2011,15(5):624-642.
[19]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.
Spark:Cluster computing with working sets[J].HotCloud,2010,10(10-10):95.
[20]LIU X,ZHOU Y,GUAN X,et al.A feasible graph partition framework for parallel computing of big graph[J].Knowledge-Based Systems,2017,134:228-239.
[21]KARYPIS G,KUMAR V.A fast and high quality multilevel
scheme for partitioning irregular graphs[J].SIAM Journal on Scientific Computing,1998,20(1):359-392.
[22]KERNIGHAN B W,LIN S.An efficient heuristic procedure for partitioning graphs[J].The Bell System Technical Journal,1970,49(2):291-307.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 傅丽玉, 陆歌皓, 吴义明, 罗娅玲.
区块链技术的研究及其发展综述
Overview of Research and Development of Blockchain Technology
计算机科学, 2022, 49(6A): 447-461. https://doi.org/10.11896/jsjkx.210600214
[3] 杨亚红, 王海瑞.
基于Renyi熵和BiGRU算法实现SDN环境下的DDoS攻击检测方法
DDoS Attack Detection Method in SDN Environment Based on Renyi Entropy and BiGRU Algorithm
计算机科学, 2022, 49(6A): 555-561. https://doi.org/10.11896/jsjkx.210800095
[4] 孙浩, 毛瀚宇, 张岩峰, 于戈, 徐石成, 何光宇.
区块链跨链技术发展及应用
Development and Application of Blockchain Cross-chain Technology
计算机科学, 2022, 49(5): 287-295. https://doi.org/10.11896/jsjkx.210800132
[5] 冯了了, 丁滟, 刘坤林, 马科林, 常俊胜.
区块链BFT共识算法研究进展
Research Advance on BFT Consensus Algorithms
计算机科学, 2022, 49(4): 329-339. https://doi.org/10.11896/jsjkx.210700011
[6] 梁静茹, 鄂海红, 宋美娜.
基于属性图模型的领域知识图谱构建方法
Method of Domain Knowledge Graph Construction Based on Property Graph Model
计算机科学, 2022, 49(2): 174-181. https://doi.org/10.11896/jsjkx.210500076
[7] 谭双杰, 林宝军, 刘迎春, 赵帅.
基于机器学习的分布式星载RTs系统负载调度算法
Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning
计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126
[8] 王如斌, 李瑞远, 何华均, 刘通, 李天瑞.
面向海量空间数据的分布式距离连接算法
Distributed Distance Join Algorithm for Massive Spatial Data
计算机科学, 2022, 49(1): 95-100. https://doi.org/10.11896/jsjkx.210100060
[9] 张杰, 岳韶华, 王刚, 刘家义, 姚小强.
基于Stackelberg与边拉普拉斯矩阵的多智能体系统
Multi-agent System Based on Stackelberg and Edge Laplace Matrix
计算机科学, 2021, 48(8): 253-262. https://doi.org/10.11896/jsjkx.200700032
[10] 唐飞, 陈云龙, 冯卓.
基于区块链和代理重加密的电子处方共享方案
Electronic Prescription Sharing Scheme Based on Blockchain and Proxy Re-encryption
计算机科学, 2021, 48(6A): 498-503. https://doi.org/10.11896/jsjkx.201000143
[11] 钱甜甜, 张帆.
基于分布式边缘计算的情绪识别系统
Emotion Recognition System Based on Distributed Edge Computing
计算机科学, 2021, 48(6A): 638-643. https://doi.org/10.11896/jsjkx.201000010
[12] 黄梅根, 刘川, 杜欢, 刘佳乐.
基于知识图谱的认知诊断模型及其在教辅中的应用研究
Research on Cognitive Diagnosis Model Based on Knowledge Graph and Its Application in Teaching Assistant
计算机科学, 2021, 48(6A): 644-648. https://doi.org/10.11896/jsjkx.200700163
[13] 高枫越, 王琰, 朱铁兰.
有适应力的分布式状态估计方法
Resilient Distributed State Estimation Algorithm
计算机科学, 2021, 48(5): 308-312. https://doi.org/10.11896/jsjkx.200300117
[14] 张航, 唐聃, 蔡红亮.
分布式存储系统中的预测式纠删码研究
Study on Predictive Erasure Codes in Distributed Storage System
计算机科学, 2021, 48(5): 130-139. https://doi.org/10.11896/jsjkx.200300124
[15] 丁诗铭, 王天荆, 沈航, 白光伟.
基于能量分类器的抗SSDF攻击协作频谱感知算法
Energy Classifier Based Cooperative Spectrum Sensing Algorithm for Anti-SSDF Attack
计算机科学, 2021, 48(2): 282-288. https://doi.org/10.11896/jsjkx.191100124
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!