Computer Science ›› 2019, Vol. 46 ›› Issue (4): 28-35.doi: 10.11896/j.issn.1002-137X.2019.04.005

• Big Data & Data Science • Previous Articles     Next Articles

Distributed Subgraph Matching Algorithm for Large Scale Graph Data

XU Wen, SONG Wen-ai, FU Li-zhen, LV Wei   

  1. School of Software,North University of China,Taiyuan 030051,China
  • Received:2018-09-27 Online:2019-04-15 Published:2019-04-23

Abstract: The explosive growth of graph data size makes it difficult to process subgraph matching on single machine.Although the existing distributed algorithms could solve the problem to some extent,the network communication cost among the distributed computing nodes still affects the performance of algorithm.To solve this problem,this paper proposed a distributed subgraph matching algorithm,named DSsearch.It includes four steps:spliting query graph,preprocessing data graph,filtering candidate and combining intermediate results.In the data graph preprocessing,the strategy of graph partitioning and perfect neighbor vertex are used to reduce the communication cost among the distributed computing nodes.The DSgraph storage structure is designed to store candidate vertices during the filtering candidate vertex stage,and the redundant intermediate results are reduced by delaying the Cartesian product.Finally,a comparative experiment was designed and real data sets verification was performed on a Spark distributed cluster with 7 compute nodes.The experimental results show that the DSsearch algorithm can complete subgraph matching of data graphs of millions of vertices in seconds,especially in dealing with complex query graphs and dense data graphs.The experimental results of the data graph preprocessing strategy illustrate the feasibility of reducing network communication cost in a distributed environment through vertex replication.Compared with other algorithms such as TwinTwigJoin and PSgL,the increase of running time of DSsearch algorithm becomes slowly as the number of vertices in the query graph increases.When the number of query graph vertices reaches 14,the running time is only half of that of TwinTwigJoin and PSgL algorithms.The experimental data fully demonstrates that the data transmission cost and the number of intermediate results in the distributed system are the main factors affecting the efficiency of distributed subgraph matching algorithm.Realization of preprocessing the data graph and delaying the Cartesian product solves the bottleneck problem of the performance of distributed subgraph matching and effectively completes the subgraph matching of large-scale graph data.

Key words: Distributed, Graph data, Graph partition, Subgraph matching, Subgraph query

CLC Number: 

  • TP301
[1]ULLMANN J R.An algorithm for subgraph isomorphism[J].Journal of the ACM,1976,23(1):31-42.
[2]CORDELLA L P,FOGGIA P,SANSONE C,et al.A (sub)
graph isomorphism algorithm for matching large graphs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(10):1367-1372.
[3]LEE J,HAN W S,KASPEROVICS R,et al.An in-depth comparison of subgraph isomorphism algorithms in graph databases[J].Proceedings of the VLDB Endowment,2012,6(2):133-144.
[4]HAN W S,LEE J,LEE J H.Turbo iso:towards ultrafast and
robust subgraph isomorphism search in large graph databases[C]∥Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:337-348.
[5]BI F,CHANG L,LIN X,et al.Efficient subgraph matching by postponing cartesian products[C]∥Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1199-1214.
[6]GIUGNO R,SHASHA D.Graphgrep:A fast and universal
method for querying graphs[C]∥16th International Conference on Pattern Recognition.IEEE,2002:112-115.
[7]ZOU L,CHEN L,YU J X,et al.A novel spectral coding in a large graph database[C]∥Proceedings of the 11th International Conference on Extending Database Technology:Advances in database technology.ACM,2008:181-192.
[8]HE H,SINGH A K.Closure-tree:An index structure for graph queries[C]∥Proceedings of the 22nd International Conference on Data Engineering(ICDE’06).IEEE,2006:38-38.
[9]SUN Z,WANG H,WANG H,et al.Efficient subgraph matc-
hing on billion node graphs[J].Proceedings of the VLDB Endowment,2012,5(9):788-799.
[10]SHAO B,WANG H,LI Y.Trinity:A distributed graph engine on a memory cloud[C]∥Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:505-516.
[11]ZHAO Z,WANG G,BUTT A R,et al.Sahad:Subgraph analysis in massive networks using hadoop[C]∥2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS).IEEE,2012:390-401.
[12]ALON N,DAO P,HAJIRASOULIHA I,et al.Biomolecular
network motif counting and discovery by color coding[J].Bioinformatics,2008,24(13):i241-i249.
[13]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-
sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[14]LAI L,QIN L,LIN X,et al.Scalable subgraph enumeration in mapreduce[J].Proceedings of the VLDB Endowment,2015,8(10):974-985.
[15]SHAO Y,CUI B,CHEN L,et al.Parallel subgraph listing in a large-scale graph[C]∥Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.ACM,2014:625-636.
[16]REZA T,KLYMKO C,RIPEANU M,et al.Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs[C]∥2017 IEEE International Conference on Cluster Computing (CLUSTER).IEEE,2017:1-12.
[17]SUO B,LI Z,PAN W.Parallel subgraph matching on massive graphs[C]∥International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI).IEEE,2016:1932-1937.
[18]BENLIC U,HAO J K.A multilevel memetic approach for improving graph k-partitions[J].IEEE Transactions on Evolutiona-ry Computation,2011,15(5):624-642.
[19]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.
Spark:Cluster computing with working sets[J].HotCloud,2010,10(10-10):95.
[20]LIU X,ZHOU Y,GUAN X,et al.A feasible graph partition framework for parallel computing of big graph[J].Knowledge-Based Systems,2017,134:228-239.
[21]KARYPIS G,KUMAR V.A fast and high quality multilevel
scheme for partitioning irregular graphs[J].SIAM Journal on Scientific Computing,1998,20(1):359-392.
[22]KERNIGHAN B W,LIN S.An efficient heuristic procedure for partitioning graphs[J].The Bell System Technical Journal,1970,49(2):291-307.
[1] LU Chen-yang, DENG Su, MA Wu-bin, WU Ya-hui, ZHOU Hao-hao. Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients [J]. Computer Science, 2022, 49(9): 183-193.
[2] FU Li-yu, LU Ge-hao, WU Yi-ming, LUO Ya-ling. Overview of Research and Development of Blockchain Technology [J]. Computer Science, 2022, 49(6A): 447-461.
[3] YANG Ya-hong, WANG Hai-rui. DDoS Attack Detection Method in SDN Environment Based on Renyi Entropy and BiGRU Algorithm [J]. Computer Science, 2022, 49(6A): 555-561.
[4] SUN Hao, MAO Han-yu, ZHANG Yan-feng, YU Ge, XU Shi-cheng, HE Guang-yu. Development and Application of Blockchain Cross-chain Technology [J]. Computer Science, 2022, 49(5): 287-295.
[5] FENG Liao-liao, DING Yan, LIU Kun-lin, MA Ke-lin, CHANG Jun-sheng. Research Advance on BFT Consensus Algorithms [J]. Computer Science, 2022, 49(4): 329-339.
[6] LIANG Jing-ru, E Hai-hong, Song Mei-na. Method of Domain Knowledge Graph Construction Based on Property Graph Model [J]. Computer Science, 2022, 49(2): 174-181.
[7] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[8] WANG Ru-bin, LI Rui-yuan, HE Hua-jun, LIU Tong, LI Tian-rui. Distributed Distance Join Algorithm for Massive Spatial Data [J]. Computer Science, 2022, 49(1): 95-100.
[9] ZHANG Jie, YUE Shao-hua, WANG Gang, LIU Jia-yi, YAO Xiao-qiang. Multi-agent System Based on Stackelberg and Edge Laplace Matrix [J]. Computer Science, 2021, 48(8): 253-262.
[10] TANG Fei, CHEN Yun-long, FENG Zhuo. Electronic Prescription Sharing Scheme Based on Blockchain and Proxy Re-encryption [J]. Computer Science, 2021, 48(6A): 498-503.
[11] LU Yong-chao, WANG Bin-yi, HU Jiang-feng, MU Yang, REN Jun-long. Research on Integrated Electronic Time Synchronization Technology [J]. Computer Science, 2021, 48(6A): 629-632.
[12] QIAN Tian-tian, ZHANG Fan. Emotion Recognition System Based on Distributed Edge Computing [J]. Computer Science, 2021, 48(6A): 638-643.
[13] HUANG Mei-gen, LIU Chuan, DU Huan, LIU Jia-le. Research on Cognitive Diagnosis Model Based on Knowledge Graph and Its Application in Teaching Assistant [J]. Computer Science, 2021, 48(6A): 644-648.
[14] CAO Xue-fei, NIU Qian, WANG Rui-bo, WANG Yu, LI Ji-hong. Distributed Representation Learning and Improvement of Chinese Words Based on Co-occurrence [J]. Computer Science, 2021, 48(6): 222-226.
[15] GAO Feng-yue, WANG Yan, ZHU Tie-lan. Resilient Distributed State Estimation Algorithm [J]. Computer Science, 2021, 48(5): 308-312.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!