Computer Science ›› 2019, Vol. 46 ›› Issue (4): 28-35.doi: 10.11896/j.issn.1002-137X.2019.04.005

• Big Data & Data Science • Previous Articles     Next Articles

Distributed Subgraph Matching Algorithm for Large Scale Graph Data

XU Wen, SONG Wen-ai, FU Li-zhen, LV Wei   

  1. School of Software,North University of China,Taiyuan 030051,China
  • Received:2018-09-27 Online:2019-04-15 Published:2019-04-23

Abstract: The explosive growth of graph data size makes it difficult to process subgraph matching on single machine.Although the existing distributed algorithms could solve the problem to some extent,the network communication cost among the distributed computing nodes still affects the performance of algorithm.To solve this problem,this paper proposed a distributed subgraph matching algorithm,named DSsearch.It includes four steps:spliting query graph,preprocessing data graph,filtering candidate and combining intermediate results.In the data graph preprocessing,the strategy of graph partitioning and perfect neighbor vertex are used to reduce the communication cost among the distributed computing nodes.The DSgraph storage structure is designed to store candidate vertices during the filtering candidate vertex stage,and the redundant intermediate results are reduced by delaying the Cartesian product.Finally,a comparative experiment was designed and real data sets verification was performed on a Spark distributed cluster with 7 compute nodes.The experimental results show that the DSsearch algorithm can complete subgraph matching of data graphs of millions of vertices in seconds,especially in dealing with complex query graphs and dense data graphs.The experimental results of the data graph preprocessing strategy illustrate the feasibility of reducing network communication cost in a distributed environment through vertex replication.Compared with other algorithms such as TwinTwigJoin and PSgL,the increase of running time of DSsearch algorithm becomes slowly as the number of vertices in the query graph increases.When the number of query graph vertices reaches 14,the running time is only half of that of TwinTwigJoin and PSgL algorithms.The experimental data fully demonstrates that the data transmission cost and the number of intermediate results in the distributed system are the main factors affecting the efficiency of distributed subgraph matching algorithm.Realization of preprocessing the data graph and delaying the Cartesian product solves the bottleneck problem of the performance of distributed subgraph matching and effectively completes the subgraph matching of large-scale graph data.

Key words: Subgraph matching, Subgraph query, Distributed, Graph data, Graph partition

CLC Number: 

  • TP301
[1]ULLMANN J R.An algorithm for subgraph isomorphism[J].Journal of the ACM,1976,23(1):31-42.
graph isomorphism algorithm for matching large graphs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(10):1367-1372.
[3]LEE J,HAN W S,KASPEROVICS R,et al.An in-depth comparison of subgraph isomorphism algorithms in graph databases[J].Proceedings of the VLDB Endowment,2012,6(2):133-144.
[4]HAN W S,LEE J,LEE J H.Turbo iso:towards ultrafast and
robust subgraph isomorphism search in large graph databases[C]∥Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:337-348.
[5]BI F,CHANG L,LIN X,et al.Efficient subgraph matching by postponing cartesian products[C]∥Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1199-1214.
[6]GIUGNO R,SHASHA D.Graphgrep:A fast and universal
method for querying graphs[C]∥16th International Conference on Pattern Recognition.IEEE,2002:112-115.
[7]ZOU L,CHEN L,YU J X,et al.A novel spectral coding in a large graph database[C]∥Proceedings of the 11th International Conference on Extending Database Technology:Advances in database technology.ACM,2008:181-192.
[8]HE H,SINGH A K.Closure-tree:An index structure for graph queries[C]∥Proceedings of the 22nd International Conference on Data Engineering(ICDE’06).IEEE,2006:38-38.
[9]SUN Z,WANG H,WANG H,et al.Efficient subgraph matc-
hing on billion node graphs[J].Proceedings of the VLDB Endowment,2012,5(9):788-799.
[10]SHAO B,WANG H,LI Y.Trinity:A distributed graph engine on a memory cloud[C]∥Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:505-516.
[11]ZHAO Z,WANG G,BUTT A R,et al.Sahad:Subgraph analysis in massive networks using hadoop[C]∥2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS).IEEE,2012:390-401.
[12]ALON N,DAO P,HAJIRASOULIHA I,et al.Biomolecular
network motif counting and discovery by color coding[J].Bioinformatics,2008,24(13):i241-i249.
[13]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-
sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[14]LAI L,QIN L,LIN X,et al.Scalable subgraph enumeration in mapreduce[J].Proceedings of the VLDB Endowment,2015,8(10):974-985.
[15]SHAO Y,CUI B,CHEN L,et al.Parallel subgraph listing in a large-scale graph[C]∥Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.ACM,2014:625-636.
[16]REZA T,KLYMKO C,RIPEANU M,et al.Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs[C]∥2017 IEEE International Conference on Cluster Computing (CLUSTER).IEEE,2017:1-12.
[17]SUO B,LI Z,PAN W.Parallel subgraph matching on massive graphs[C]∥International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI).IEEE,2016:1932-1937.
[18]BENLIC U,HAO J K.A multilevel memetic approach for improving graph k-partitions[J].IEEE Transactions on Evolutiona-ry Computation,2011,15(5):624-642.
Spark:Cluster computing with working sets[J].HotCloud,2010,10(10-10):95.
[20]LIU X,ZHOU Y,GUAN X,et al.A feasible graph partition framework for parallel computing of big graph[J].Knowledge-Based Systems,2017,134:228-239.
[21]KARYPIS G,KUMAR V.A fast and high quality multilevel
scheme for partitioning irregular graphs[J].SIAM Journal on Scientific Computing,1998,20(1):359-392.
[22]KERNIGHAN B W,LIN S.An efficient heuristic procedure for partitioning graphs[J].The Bell System Technical Journal,1970,49(2):291-307.
[1] YANG Li-peng, ZHANG Yang-sen, ZHANG Wen, WANG Jian, ZENG Jian-rong. Web Log Analysis Method Based on Storm Real-time Streaming Computing Framework [J]. Computer Science, 2019, 46(9): 176-183.
[2] ZHONG Feng-yan, WANG Yan, LI Nian-shuang. Node Selection Scheme for Data Repair in Heterogeneous Distributed Storage Systems [J]. Computer Science, 2019, 46(8): 35-41.
[3] LI Bo-jia, ZHANG Yang-sen, CHEN Ruo-yu. Method for Generating Massive Data with Assignable Distribution [J]. Computer Science, 2019, 46(8): 56-63.
[4] WANG Ya-hui, LIU Bo, YUAN Xiao-tong. Distributed Convolutional Neural Networks Based on Approximate Newton-type Mothod [J]. Computer Science, 2019, 46(7): 180-185.
[5] WU Can, WANG Xiao-ning, XIAO Hai-li, CAO Rong-qiang, ZHAO Yi-ning, CHI Xue-bin. Survey on Distributed Message System [J]. Computer Science, 2019, 46(6A): 1-5.
[6] XU Zhe, LIU Liang, QIN Xiao-lin, QIN Wei-meng. Distributed Spatial Keyword Query Processing Algorithm with Relational Attributes [J]. Computer Science, 2019, 46(6A): 402-406.
[7] LI Xiao, WEI Chang-jiang. Study on Complete Requirement Acquiring Based on Tracking Matrix [J]. Computer Science, 2019, 46(6): 189-195.
[8] HUANG Jian-yi, LI Jian-jiang, WANG Zheng, FANG Ming-zhe. Single-Pass Short Text Clustering Based on Context Similarity Matrix [J]. Computer Science, 2019, 46(4): 50-56.
[9] SHU Na,LIU Bo,LIN Wei-wei,LI Peng-fei. Survey of Distributed Machine Learning Platforms and Algorithms [J]. Computer Science, 2019, 46(3): 9-18.
[10] LI De-quan, DONG Qiao, ZHOU Yue-jin. Distributed Online Conditional Gradient Optimization Algorithm [J]. Computer Science, 2019, 46(3): 332-337.
[11] ZHAO Ping, SHOU Li-dan, CHEN Ke, CHEN Gang, WU Xiao-fan. Storage and Query Model for Localized Search on Temporal Graph Data [J]. Computer Science, 2019, 46(10): 186-194.
[12] DU Xiu-li, HU Xing, CHEN Bo, QIU Shao-ming. Multi-hypothesis Reconstruction Algorithm of DCVS Based on Weighted Non-local Similarity [J]. Computer Science, 2019, 46(1): 291-296.
[13] LI Yan, MA Jun-ming, AN Bo, CAO Dong-gang. Web Based Lightweight Tool for Big Data Processing and Visualization [J]. Computer Science, 2018, 45(9): 60-64, 93.
[14] HE Si-yuan, OU Bo, LIAO Xin. Role Matching Access Control Model for Distributed Workflow [J]. Computer Science, 2018, 45(7): 129-134.
[15] LI Peng-yuan,ZHANG Zhi-yong. Design of Storage Platform for Large Scale Data Based on SWIFT System [J]. Computer Science, 2018, 45(6A): 601-605.
Full text



[1] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[2] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[6] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .
[7] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105, 130 .
[8] WANG Zhen-wu, LV Xiao-hua and HAN Xiao-hui. Survey of Terrain LOD Technology Based on Quadtree Segmentation[J]. Computer Science, 2018, 45(4): 34 -45 .
[9] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .
[10] LI Shan and RAO Wen-bi. Video-based Detection of Human Motion Area in Mine[J]. Computer Science, 2018, 45(4): 291 -295 .