%A XU Wen, SONG Wen-ai, FU Li-zhen, LV Wei %T Distributed Subgraph Matching Algorithm for Large Scale Graph Data %0 Journal Article %D 2019 %J Computer Science %R 10.11896/j.issn.1002-137X.2019.04.005 %P 28-35 %V 46 %N 4 %U {https://www.jsjkx.com/CN/abstract/article_18105.shtml} %8 2019-04-15 %X The explosive growth of graph data size makes it difficult to process subgraph matching on single machine.Although the existing distributed algorithms could solve the problem to some extent,the network communication cost among the distributed computing nodes still affects the performance of algorithm.To solve this problem,this paper proposed a distributed subgraph matching algorithm,named DSsearch.It includes four steps:spliting query graph,preprocessing data graph,filtering candidate and combining intermediate results.In the data graph preprocessing,the strategy of graph partitioning and perfect neighbor vertex are used to reduce the communication cost among the distributed computing nodes.The DSgraph storage structure is designed to store candidate vertices during the filtering candidate vertex stage,and the redundant intermediate results are reduced by delaying the Cartesian product.Finally,a comparative experiment was designed and real data sets verification was performed on a Spark distributed cluster with 7 compute nodes.The experimental results show that the DSsearch algorithm can complete subgraph matching of data graphs of millions of vertices in seconds,especially in dealing with complex query graphs and dense data graphs.The experimental results of the data graph preprocessing strategy illustrate the feasibility of reducing network communication cost in a distributed environment through vertex replication.Compared with other algorithms such as TwinTwigJoin and PSgL,the increase of running time of DSsearch algorithm becomes slowly as the number of vertices in the query graph increases.When the number of query graph vertices reaches 14,the running time is only half of that of TwinTwigJoin and PSgL algorithms.The experimental data fully demonstrates that the data transmission cost and the number of intermediate results in the distributed system are the main factors affecting the efficiency of distributed subgraph matching algorithm.Realization of preprocessing the data graph and delaying the Cartesian product solves the bottleneck problem of the performance of distributed subgraph matching and effectively completes the subgraph matching of large-scale graph data.