计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 32-39.doi: 10.11896/jsjkx.220600078

• 数据库&大数据&数据科学 • 上一篇    下一篇

LayerLSB:基于分层局部敏感B树的最近邻搜索

丁际文, 刘卓锦, 王家兴, 张岩峰, 于戈   

  1. 东北大学计算机科学与工程学院 沈阳 110167
  • 收稿日期:2022-06-08 修回日期:2022-10-19 出版日期:2023-04-15 发布日期:2023-04-06
  • 通讯作者: 张岩峰(zhangyf@mail.neu.edu.cn)
  • 作者简介:(djw@mail.neu.edu.cn)
  • 基金资助:
    国家自然科学基金(62072082);中央高校基本科研业务费(N2216015)

LayerLSB:Nearest Neighbors Search Based on Layered Locality Sensitive B-tree

DING Jiwen, LIU Zhuojin, WANG Jiaxing, ZHANG Yanfeng, YU Ge   

  1. School of Computer Science and Engineering,Northeastern University,Shenyang 110167,China
  • Received:2022-06-08 Revised:2022-10-19 Online:2023-04-15 Published:2023-04-06
  • About author:DING Jiwen,born in 1982,postgra-duate,is a member of China Computer Federation.His main research interests include big data analysis and database.
    ZHANG Yanfeng,born in 1982,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include big data mining,large scale machine learning,and distributed systems.
  • Supported by:
    National Natural Science Foundation of China(62072082) and Fundamental Research Funds for the Central Universities(N2216015).

摘要: 最近邻搜索由于其广泛的应用已成为一个重要的研究课题。传统的空间索引结构,如R-tree和KD-tree,可以在低维空间中高效地返回准确的最近邻搜索结果,但不适用于高维空间。局部敏感B树(LSB)将数据点哈希到可排序的一维值,并将它们排列成树状结构,这在不影响结果质量的前提下极大地提高了传统局部敏感哈希(LSH)所需的空间和查询效率。但是,LSB并没有考虑到数据分布,它在均匀的数据分布设置中表现良好,但在数据倾斜时表现出了不稳定的性能。针对这个问题,文中提出了LayerLSB,通过探索哈希值的密度对密集范围内的哈希值进行重建,使其分布更均匀,从而提高查询效率。 相比LSB,LayerLSB索引在数据分布方面变得更有针对性,并构建了多层结构,与简单的重新哈希方法相比,多层方法会通过仔细选择组数和哈希函数来保证搜索质量。实验结果表明,在达到相同查询精度的情况下,查询成本最多可降低为原来的44.6%。

关键词: 最近邻搜索, 分层结构, 局部敏感哈希, 局部敏感B树

Abstract: Nearest neighbor search has become a significant research problem due to its wide applications.Traditional spatial index structures such as R-tree and KD-tree can efficiently return accurate nearest neighbor search results in low-dimensional space,but they are not suitable for high-dimensional space.Locality sensitive B-tree(LSB) hashes data points to the sortable one-dimension values and arranges them in a tree-like structure,which dramatically improves the space and query efficiency of the previous locality sensitive hashing(LSH) implementations,without compromising the resulting quality.However,LSB fails to take data distribution into account.It performs well in a uniform data distribution setting,but exhibits unstable performance when the data are skewed.In response to this problem,this paper proposes LayerLSB,which reconstructs the hash values in a dense range by exploring the density of the hash values to make the distribution more uniform,so as to improve the query efficiency.Compared to LSB,LayerLSB indices become more targeted in terms of data distribution,and a multi-layered structure is constructed.Compared with the simple rehashing method,the multi-layered approach will still guarantee the search quality by carefully choosing the number of groups and hash functions.The results show that the query cost can be reduced to 44.6% of the original at most when achieving the same query accuracy.

Key words: Nearest neighbor search, Layered structure, Locality sensitive hashing, Locality sensitive B-tree

中图分类号: 

  • TP319
[1]GUTTMAN A.R-trees:a dynamic index structure for spatialsearching[C]//Proceedings of Acm Sigmod International Conference on Management of Data.New York,1984:47-57.
[2]KATAYAMA N.The SR-tree:an index structure for high-dimensional nearest neighbor queries[J].ACM Sigmod Record,1997,26(2):369-380.
[3]BENTLEY J L.Multidimensional binary search trees used for associative searching[J].Communications of the ACM,1975,18(9):509-517.
[4]WEBERR.A Quantitative Analysis and Performance Study for Similarity-Search Methods in high-Dimensional Spaces[C]//Proceedings of the VLDB Conference.New York,1998:194-205.
[5]TAO Y,KE Y,CHENG S,et al.Quality and efficiency in high dimensional nearest neighbor search[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.Rhode Island,USA,2009:563-576.
[6]YAO B,LI F,KUMAR P.K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free[C]//Proceedings of the 2010 IEEE 26th International Conference on Data Engineering.2010:1103-1114.
[7]CHI Z,LI F,JESTES J.Efficient parallel kNN joins for largedata in MapReduce[C]//Proceedings of the Extending Database Technology.ACM,2012:38-49.
[8]INDYK P.Approximate Nearest Neighbors:Towards Removing the Curse of Dimensionality[C]//Proceedings of the 30th ACM Symposium on Theory of Computing(STOC’98).1998.
[9]DATAR M.Locality-sensitive hashing scheme based on p-stable distributions[C]//Proceedings of the 20th ACM Symposium on Computational Geometry.2004:253-262.
[10]ZOLOTAREV V M.One-dimensional stable distributions[M].American Mathematical Society,1986.
[11]GAEDE V,GUNTHER O.Multidimensional Access Methods[J].ACM Computing Surveys,1998,30(2):170-231.
[12]DATAR M,IMMORLICA N,INDYK P,et al.Locality sensitive hashing scheme based on p-stable distributions[C]//Procee-dings of the Twentieth Annual Symposium on Computational Geometry(Scg ’04).2004:253-262.
[13]SHRIVASTAVA A,LI P.Asymmetric LSH(ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)[C]//Proceedings of the MIT Press.MIT Press,2014:2321-2329.
[14]ANDONI A,INDYK P,LAARHOVEN T,et al.Practical and optimal LSH for angular distance[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1.2015:1225-1233.
[15]LV Q,JOSEPHSON W,ZHE W,et al.Multi-probe LSH:efficient indexing for high-dimensional similarity search[C]//Proceedings of the International Conference on Very Large Data Bases.VLDB Endowment,2007:950-961.
[16]PANIGRAHY R.Entropy based Nearest Neighbor Search inHigh Dimensions[C]//Proceedings of the Seventeenth Annual {ACM-SIAM} Symposium on Discrete Algorithms.2005:1186-1195.
[17]LIU W,WANG H,ZHANG Y,et al.I-LSH:I/O Efficient c-Approximate Nearest Neighbor Search in High-Dimensional Space[C]//Proceedings of the 2019 IEEE 35th International Confe-rence on Data Engineering(ICDE).IEEE,2019:1670-1673.
[18]LIU W,WANG H,ZHANG Y,et al.EI-LSH:An early-termination driven I/O efficient incremental c-approximate nearest neighbor search[J].The VLDB Journal,2021,30(2):215-235.
[19]BAWA M,CONDIE T,GANESAN P.LSH forest:Self-tuningindexes for similarity search[C]//Proceedings of the 14th International Conference on World Wide Web.Chiba,Japan,2005:651-660.
[20]PAN J,MANOCHA D.Bi-level locality sensitive hashing for k-nearest neighbor computation[C]//2012 IEEE 28th International Conference on Data Engineering.IEEE,2012:378-389.
[21]SUN Y,WEI W,QIN J,et al.SRS:Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index[C]//Proceedings of the VLDB Endowment.VLDB Endowment,2014:1-12.
[22]LIU Z Y,CUI Z J,HUANG Z,et al.SK-LSH:An efficient index structure for approximate nearest neighbor search[J].Procee-dings of the VLDB Endowment,2014,7(9):745-756.
[23]ZHENG B,ZHAO X,WENG L,et al.PM-LSH:A fast and accurate LSH framework for high-dimensional approximate NN search[C]//Proceedings of the VLDB Endowment.2020:643-665.
[24]LU K,KUDO M.R2LSH:A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces[C]//Proceedings of the 2020 IEEE 36th International Conference on Data Engineering(ICDE).IEEE,2020:1045-1056.
[1] 郦睿翔, 毛莺池, 郝帅.
基于近似匹配的移动边缘计算缓存管理方法
Cache Management Method in Mobile Edge Computing Based on Approximate Matching
计算机科学, 2021, 48(1): 96-102. https://doi.org/10.11896/jsjkx.200800215
[2] 李红梅,郝文宁,陈刚.
基于改进LSH的协同过滤推荐算法
Collaborative Filtering Recommendation Algorithm Based on Improved Locality-sensitive Hashing
计算机科学, 2015, 42(10): 256-261.
[3] 许喆,陈福才,李邵梅,李星.
基于多探寻局部敏感哈希和单词映射链投票的图像检索方法
Image Retrieval Based on Multi-probe Locality Sensitive Hashing and Word Map Chain Voting
计算机科学, 2014, 41(5): 82-85. https://doi.org/10.11896/j.issn.1002-137X.2014.05.018
[4] 徐娟,李战怀,柯希林.
基于分层结构的前缀编码方案研究
Novel Prefix Encoding Scheme Based on Layered Structure
计算机科学, 2009, 36(7): 145-149. https://doi.org/10.11896/j.issn.1002-137X.2009.07.034
[5] 裘君 吴朝晖 徐昭.
基于OWL本体论映射的数据库网格语义模式集成研究

计算机科学, 2005, 32(5): 4-7.
[6] 王斌君 郝克刚.
工作流过程定义中的分层结构与正则Petri网

计算机科学, 2003, 30(11): 157-159.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!