计算机科学 ›› 2015, Vol. 42 ›› Issue (5): 204-210.doi: 10.11896/j.issn.1002-137X.2015.05.041

• 软件与数据库技术 • 上一篇    下一篇

基于数据流的k-近邻连接算法

王 飞,秦小麟,刘 亮,沈 尧   

  1. 南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金项目(61373015,61300052),国家教育部高等学校博士学科点专项科研基金资助

Algorithm for k-Nearest Neighbor Join Based on Data Stream

WANG Fei, QIN Xiao-lin, LIU Liang and SHEN Yao   

  • Online:2018-11-14 Published:2018-11-14

摘要: k-近邻连接查询是空间数据库中一种常用的操作,该查询处理过程涉及连接和最近邻查询两个复杂操作。传统的集中式k-近邻连接查询算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-近邻连接查询算法成为了目前亟需解决的问题。现有的分布式k-近邻连接查询算法都包括了多轮串行的MapReduce任务,而每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此算法效率低下。首先提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架基础上,提出了一种高效的k-近邻连接算法,它利用空间填充曲线将多维数据映射为一维数据,从而将k-近邻连接查询转化为一维范围查询。实验结果表明,该算法的可扩展性较高,且效率比现有算法更优。

关键词: k-近邻连接,数据流,MapReduce,计算框架

Abstract: kNN join is a frequently used operation in spatial database.It involves both the join and the NN search.Data scale is exploding,and traditional centralized algorithm can not meet the requirements.It is an urgent problem to design distributed kNN join algorithm currently.Existing distributed algorithms include several rounds of serial MapReduce tasks,but each MapReduce task reads and writes data from distributed file system.It is inefficient when expressing dependencies between jobs,and these algorithms are inefficient.Firstly,we proposed a framework based on data stream on MapReduce.This framework models data handle process according to the data flow diagram,and we proposed an efficient kNN join algorithm on the framework.The algorithm maps multi-dimensional data sets into one dimension using space-filling curves (z-values),and transforms kNN joins into a sequence of one-dimensional range searches.Experimental results demonstrate that the algorithm can efficiently resolve the large scale kNN join spatial query.

Key words: kNN join,Data stream,MapReduce,Framework

[1] Bhm C,Krebs F.The k-nearest neighbour join:Turbo charging the KDD process[J].Knowledge and Information Systems,2004,6(6):728-749
[2] Kavraki L E,Plaku E.Distributed Computation of the knnGraph for Large High-Dimensional Point Sets[J].Parallel Distributed Computation,2007,7(3):346-359
[3] Sardana D,Bhatnagar R.Graph Clustering Using Mutual K-Nearest Neighbors[M]∥Active Media Technology.Springer International Publishing,2014:35-48
[4] Xia C,Lu H,Ooi B C,et al.Gorder:an efficient method for KNN join processing[C]∥Proceedings of the Thirtieth international conference on VLDB Endowment.2004:756-767
[5] Yu C,Cui B,Wang S,et al.Efficient index-based KNN join processing for high-dimensional data[J].Information and Software Technology,2007,49(4):332-344
[6] Cheung K L,Fu A W C.Enhanced nearest neighbour search on the R-tree[J].ACM SIGMOD Record,1998,27(3):16-21
[7] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113
[8] Zhang C,Li F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proceedings of the 15th International Conference on Extending Database Technology.ACM,2012:38-49
[9] Lu W,Shen Y,Chen S,et al.Efficient processing of k nearest neighbor joins using mapreduce[J].Proceedings of the VLDB Endowment,2012,5(10):1016-1027
[10] 刘义,景宁,陈荦,等.MapReduce 框架下基于 R-树的 k-近邻连接算法[J].软件学报,2013,4(8):1836-1851
[11] Apache.Hadoop[EB/OL].2014-4-10[2014-4-22].http://ha-doop.apache.org/
[12] Liu Y,Cui J,Huang Z,et al.SK-LSH:An Efficient Index Structure for Approximate Nearest Neighbor Search[J].Proceedings of the VLDB Endowment,2014,7(9):745-756
[13] Choi W,Oh S.Fast nearest neighbor search algorithm using the cache technique[J].Advanced Robotics,2013,27(15):1175-1187
[14] Gieseke F,Heinermann J,Oancea C,et al.Buffer kd trees:processing massive nearest neighbor queries on GPUs[C]∥Proceedings of The 31st International Conference on Machine Learning.2014:172-180
[15] Luo G,Naughton J F,Ellmann C J.A non-blocking parallel spatial join algorithm[C]∥ Proceedings of the 2002 IEEE 18th International Conference on Data Engineering.2002:697-705
[16] 何洪辉,王丽珍,周丽华.pgi-distance:一种高效的并行KNN-join处理方法[J].计算机研究与发展,2007,4(10):1774-1781
[17] Mutenda L,Kitsuregawa M.Parallel r-tree spatial join for ashared-nothing architecture[C]∥Proceedings of the 1999 IEEE International Symposium on Database Applications in Non-Traditional Environments.1999:423-430
[18] Yu C,Cui B,Wang S,et al.Efficient index-based KNN join processing for high-dimensional data[J].Information and Software Technology,2007,49(4):332-344
[19] Yao B,Li F,Kumar P.K nearest neighbor queries andknn-joins in large relational databases (almost) for free[C]∥ Proceedings of the 2010 IEEE 26th International Conference on Data Engineering.2010:4-15

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!