Computer Science ›› 2022, Vol. 49 ›› Issue (1): 95-100.doi: 10.11896/jsjkx.210100060

• Database & Big Data & Data Science • Previous Articles     Next Articles

Distributed Distance Join Algorithm for Massive Spatial Data

WANG Ru-bin1,3, LI Rui-yuan2,3, HE Hua-jun1,3, LIU Tong4, LI Tian-rui1   

  1. 1 School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China
    2 College of Computer Science,Chongqing University,Chongqing 400044,China
    3 JD Intelligent Cities Research,Beijing 100176,China
    4 School of Computer Engineering and Science,Shanghai University,Shanghai 200444,China
  • Received:2021-01-07 Revised:2021-05-12 Online:2022-01-15 Published:2022-01-18
  • About author:WANG Ru-bin,born in 1994,postgra-duate.His main research interests include spatio-temporal data management.
    LI Rui-yuan,born in 1990,Ph.D.His main research interests include spatio-temporal data management and mining.
  • Supported by:
    National Key Research and Development Program of China(2019YFB2101801).

Abstract: Spatial distance join is one of the most common operations for spatial data analysis,which has various application scenarios.Existing distributed methods face the problems of too large space,high data skew,and slow self-join.To this end,this paper proposes a novel distributed distance join algorithm,i.e.,JUST-Join,for massive spatial data.First,JUST-Join regards only the necessary space as the global domain,which can filter invalid data out,reducing the overhead of unnecessary data transmission and computation.Second,we consider both the spatial distributions of the two datasets,which relieves the data skew issue.Third,for the spatial self-join,we adopt plane sweep method to further improve the efficiency.We implement JUST-Join algorithm based on Spark,and conduct extensive experiments using real datasets.The experimental results show that JUST-Join is superior to the state-of-the-art distributed spatial analysis systems in terms both of efficiency and scalability.

Key words: Distributed computing, Spatial distance join, Spatial indexing, Spatial partition, Spatio-temporal data

CLC Number: 

  • TP338
[1]HE H,LI R,WANG R,et al.Efficient suspected infectedcrowds detection based on spatio-temporal trajectories[J].ar-Xiv:2004.06653,2020.
[2]JACOX E H,SAMET H.Spatial join techniques[J].ACM Transactions on Database Systems (TODS),2007,32(1):7.
[3]CHEN D H,LIU L X,LE J J.Research on a Spatial Join Query with Keyword Search[J].Computer Science,2009,36(7):150-152.
[4]ELDAWY A,MOKBEL M F.Spatialhadoop:A mapreduceframework for spatial data[C]//ICDE.IEEE,2015:1352-1363.
[5]AJI A,WANG F,VO H,et al.Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J].Procee-dings of the VLDB Endowment,2013,6(11):1009-1020.
[6]YU J,WU J,SARWAT M.Geospark:A cluster computingframework for processing large-scale spatial data[C]//SIGSPATIAL.2015:1-4.
[7]TANG M,YU Y,MALLUHI Q M,et al.Locationspark:A distributed in-memory data management system for big spatial data[J].PVLDB,2016,9(13):1565-1568.
[8]YOU S,ZHANG J,GRUENWALD L.Large-scale spatial join query processing in cloud[C]//ICDE.IEEE,2015:34-41.
[9]XIE D,LI F,YAO B,et al.Simba:Efficient in-memory spatial analytics[C]//ICDE.2016:1071-1085.
[10]YANG K,DING X,ZHANG Y,et al.Distributed SimilarityQueries in Metric Spaces[J].Data Science and Engineering,2019,4(2):93-108.
[11]DEAN J,GHEMAWAT S.MapReduce:a flexible data proces-sing tool[J].Communications of the ACM,2010,53(1):72-77.
[12]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:Cluster computing with working sets[C]//Proceedings of 2nd USENIX Conference on Hot Topics in Cloud Computing.2010.
[13]QIAO B,HU B,ZHU J,et al.A top-k spatial join querying processing algorithm based on spark[J].Information Systems,2020,87:101419.
[14]WHITMAN R T,MARSH B G,PARK M B,et al.Distributed spatial and spatio-temporal join on apache spark[J].TSAS,2019,5(1):1-28.
[15]FINKEL R A,BENTLEY J L.Quad trees a data structure for retrieval on composite keys[J].Acta informatica,1974,4(1):1-9.
[16]GUTTMAN A.R-trees:A dynamic index structure for spatial searching[C]//SIGMOD.1984:47-57.
[17]LI R,HE H,WANG R,et al.Just:Jd urban spatio-temporal data engine[C]//ICDE.IEEE,2020:1558-1569.
[18]LI R,HE H,WANG R,et al.Trajmesa:A distributed nosqlstorage engine for big trajectory data[C]//ICDE.IEEE,2020:2002-2005.
[19]ELDAWY A,ALARABI L,MOKBEL M F.Spatial partitioning techniques in SpatialHadoop[J].Proceedings of the VLDB Endowment,2015,8(12):1602-1605.
[20]PREPARATA F P,SHAMOS M I.Computational geometry:an introduction[M].Springer Science & Business Media,2012.
[21]PANDEY V,KIPF A,NEUMANN T,et al.How good are mo-dern spatial analytics systems?[J].Proceedings of the VLDB Endowment,2018,11(11):1661-1673.
[1] LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[2] QIAN Tian-tian, ZHANG Fan. Emotion Recognition System Based on Distributed Edge Computing [J]. Computer Science, 2021, 48(6A): 638-643.
[3] YUAN Chen-yu, XIE Zai-peng, ZHU Xiao-rui, QU Zhi-hao, XU Yuan-yuan. Convolutional Optimization Algorithm Based on Distributed Coding [J]. Computer Science, 2021, 48(2): 47-54.
[4] LI Hao, WANG Fei, XIE Si-yu, KOU Yong-qi, ZHANG Lan, YANG Bing, KANG Yan. Dual Autoregressive Components Traffic Prediction Based on Improved Graph WaveNet [J]. Computer Science, 2021, 48(11A): 159-165.
[5] YOU Lan, HAN Xue-wei, HE Zheng-wei, XIAO Si-yu, HE Du, PAN Xiao-meng. Improved Sequence-to-Sequence Model for Short-term Vessel Trajectory Prediction Using AIS Data Streams [J]. Computer Science, 2020, 47(9): 169-174.
[6] SUN Tian-xu, ZHAO Yun-long, LIAN Zuo-wei, SUN Yi, CAI Yue-xiao. Mobility Pattern Mining for People Flow Based on Spatio-Temporal Data [J]. Computer Science, 2020, 47(10): 91-96.
[7] LI Bo-jia, ZHANG Yang-sen, CHEN Ruo-yu. Method for Generating Massive Data with Assignable Distribution [J]. Computer Science, 2019, 46(8): 56-63.
[8] LIU Chang-yun,YANG Yu-di,ZHOU Li-hua,ZHAO Li-hong. Discovering Popular Social Location with Time Label [J]. Computer Science, 2019, 46(7): 186-194.
[9] GUO Sheng-nan, LIN You-fang, JIN Wen-wei, WAN Huai-yu. Citywide Crowd Flows Prediction Based on Spatio-Temporal Recurrent Convolutional Networks [J]. Computer Science, 2019, 46(6A): 385-391.
[10] ZHU Kun, HUANG Rui-zhang and ZHANG Na-na. Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model [J]. Computer Science, 2017, 44(7): 31-37.
[11] ZHU Kai-long, LU Yu-liang and YANG Bin. Study on Invulnerability of Router-level Internet Based on MapReduce [J]. Computer Science, 2017, 44(11): 168-174.
[12] HE Ming, WU Xiao-fei, CHANG Meng-meng and REN Wan-peng. Distributed Collaborative Filtering Recommendation Based on User Co-occurrence Matrix Multiplier [J]. Computer Science, 2016, 43(Z11): 428-435.
[13] YIN Xiao-bo and LUO En. Relaxed Optimal Balanced Streaming Graph Partitioning Algorithm [J]. Computer Science, 2016, 43(4): 231-234.
[14] SUN Yan-chao and WANG Xing-fen. MapReduce Designed to Optimize Computing Model Based on Hadoop Framework [J]. Computer Science, 2014, 41(Z11): 333-336.
[15] QIN Gao-de and WEN Gao-jin. Hierarchical Scheduling of Large Scale Distributed Computation [J]. Computer Science, 2013, 40(4): 91-95.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!