计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 1-8.doi: 10.11896/jsjkx.211000149

• 数据库&大数据&数据科学 • 上一篇    下一篇

学习索引研究综述

王艺潭1, 王一舒1, 袁野2   

  1. 1 东北大学计算机科学与工程学院 沈阳 110169
    2 北京理工大学计算机学院 北京 100081
  • 收稿日期:2021-10-20 修回日期:2022-06-07 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 袁野(yuan-ye@bit.edu.cn)
  • 作者简介:2001826@stu.neu.edu.cn
  • 基金资助:
    国家重点研发计划(2022YFB2702100);国家自然科学基金(61932004,62225203,U21A20516)

Survey of Learned Index

WANG Yitan1, WANG Yishu1, YUAN Ye2   

  1. 1 School of Computer Science and Engineering,Northeastern University,Shenyang 110169,China
    2 School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China
  • Received:2021-10-20 Revised:2022-06-07 Online:2023-01-15 Published:2023-01-09
  • About author:WANG Yitan,born in 1997,postgra-duate.Her main research interests include data structure and temporal graph.
    YUAN Ye,born in 1981,Ph.D,professor,Ph.D supervisor.His main research interests include graph databases,pro-babilistic databases,data privacy-preserving,and cloud computing.
  • Supported by:
    National Key Research and Development Program of China(2022YFB2702100) and National Natural Science Foundation of China(61932004,62225203,U21A20516).

摘要: 大数据时代数据呈爆发式增长,传统索引结构难以处理庞大复杂的数据,为解决这一问题,学习索引应运而生,并成为当前数据库领域的研究热点之一。学习索引利用机器学习模型进行索引构建,通过对数据和物理位置之间的关系进行训练和学习得到学习模型,掌握二者之间的分布特点和规律,从而实现对传统索引的改进和优化。大量实验表明,与传统索引相比,学习索引可以适应大规模数据集,提供更好的搜索性能,具有更低的空间要求。文中详细介绍了学习索引的应用背景,梳理了现有的学习索引模型;根据数据类型的不同,将学习索引分为一维和多维两种类别,并对每种类别中学习索引模型的优缺点和可以支持的查询进行了详细的介绍和分析;最后对学习索引的未来研究方向进行了展望,以期为相关研究提供参考。

关键词: 学习索引, 机器学习, 索引构建, 数据结构, 数据库

Abstract: Due to the explosive growth of data in the era of big data,it is difficult for the traditional index structures to handle this huge and complex data.In order to solve this problem,the learned index has emerged and become one of the most popular research topics in the database.Learned indexes employ machine learning models for index construction.By training and learning the relationship between data and physical location,the learning model can be obtained so as to master the distribution characte-ristics between the two to realize the improvement and optimization of the traditional index.Extensive experiments show that learned indexes can adapt to large-scale data sets,and provide better search performance with lower memory requirements than traditional indexes.This paper introduces the applications of learned indexes and reviews the existing learned index models.According to data types,learned indexes are divided into two categories:one-dimensional and multi-dimensional.The advantages,disadvantages,and supported searches of learned index models in each category are also introduced and analyzed in detail.Finally,some future research directions of learned indexes are prospected to provide references for related researches.

Key words: Learned index, Machine learning, Index construction, Data structure, Database

中图分类号: 

  • TP311
[1]PAGH R,RODLER F F.Cuckoo Hashing[J].Journal of Algorithms,2004,51(2):122-144.
[2]ATHANASSOULIS M,AILAMAKI A.BF-Tree:ApproximateTree Indexing[C]//International Conference on Very Large Databases.2014:1881-1892.
[3]WITTEN I H,MOFFAT A,BELL T C.Managing Gigabytes:Compressing and Indexing Documents and Images[J].SIGMOD Record,1999,33(2):113-114.
[4]AREF W G,BARBARÁ D,VALLABHANENI P,et al.Thehandwritten trie:indexing electronic ink[C]//ACM Sigmod International Conference on Management of Data.1995:151-162.
[5]WANG J,LIN C,PAPAKONSTANTINOU Y,et al.An Expe-rimental Study of Bitmap Compression vs.Inverted List Compression[C]//the 2017 ACM International Conference.2017:993-1008.
[6]BENDER M A,DEMAINE E D,FARACH-COLTON M.Cache-oblivious B-trees[J].Siam Journal on Computing,2000,35(2):300-409.
[7]JENSEN C S,LIN D,OOI B C.Query and update efficient B+-tree based indexing of moving objects[C]//Proceedings of the Thirtieth International Conference on Very Large Data bases-Volume 30.2004:768-779.
[8]O'NEIL P,CHENG E,GAWLICK D,et al.The log-structured merge-tree(LSM-tree)[J].Acta Informatica,1996,33(4):351-385.
[9]KRASKA T,BEUTEL A,CHI E H,et al.The Case for Lear-ned Index Structures[C]//Proceedings of the 2018 International Conference on Management of Data.2018:489-504.
[10]GALAKATOS A,MARKOVITCH M,BINNIG C,et al.FI-Ting-Tree:A Data-aware Index Structure[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1189-1206.
[11]LI P,HUA Y,ZUO P,et al.A Scalable Learned Index Scheme in Storage Systems[J].arXiv:2102.06789,2019.
[12]LI P,LU H,ZHENG Q,et al.LISA:A Learned Index Structure for Spatial Data[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:2119-2133.
[13]HIGUCHI S,TAKEMASA J,KOIZUMI Y,et al.Feasibility of longest prefix matching using learned index structures[J].ACM SIGMETRICS Performance Evaluation Review,2021,48(4):45-48.
[14]RAMADHAN H,KWON J.Enhancing Learned Index for aHigher Recall Trajectory K-Nearest Neighbor Search[C]//2021 IEEE International Conference on Big Data(Big Data).2021:6006-6007.
[15]KIPF A,MARCUS R,VAN RENEN A,et al.RadixSpline:asingle-pass learned index[C]//Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management.2020:1-5.
[16]STOIAN M,KIPF A,MARCUS R,et al.PLEX:Towards Practical Learned Indexing[J].arXiv:2108.05117,2021.
[17]KRASKA T,ALIZADEH M.SageDB:A Learned Database System[C]//9th Biennial Conference on Innovative Data Systems Research.2019:13-16.
[18]WANG H X,FU X Y,XU J L,et al.:Learned Index for Spatial Queries[C]//2019 20th International Conference on Mobile Data Management.2019:569-574.
[19]DAVITKOVA A,MILCHEVSKI E,MICHEL S.The ML-Index:A Multidimensional,Learned Index for Point,Range,and Nearest-Neighbor Queries[C]//EDBT.2020:407-410.
[20]NATHAN V,DING J,ALIZADEH M,et al.Learning Multi-dimensional Indexes[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:985-1000.
[21]DING J,NATHAN V,ALIZADEH M,et al.Tsunami:A Lear-ned Multi-dimensional Index for Correlated Data and Skewed Workloads[J].Proceedings of the VLDB Endowment,2020,14(2):74-86.
[22]ZHANG S,RAY S,LU R,et al.Spatial Interpolation-based Learned Index for Range and kNN Queries[J].arXiv:2102.06789,2012.
[23]FOLMER M,NEUMANN R,KARUNANITHI T.A LearnedBucket Index Supporting Spatial Queries[D].Aalborg:Aalborg University,2019.
[24]DING J,MINHAS U F,YU J,et al.ALEX:An UpdatableAdaptive Learned Index[C]//International Conference on Ma-nagement of Data Conference.2020:969-984.
[25]WU J,ZHANG Y,CHEN S,et al.Updatable Learned Indexwith Precise Positions[J].Proceedings of the VLDB Endowment,2021,14(8):1276-1288.
[26]WU Y J,YU J,TIAN Y Y,et al.:Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1223-1240.
[27]ZHANG J,GAO Y.CARMI:A Cache-Aware Learned Indexwith a Cost-based Construction Algorithm[J].arXiv:2103.00858,2021.
[28]TANG C,WANG Y,DONG Z,et al.XIndex:A Scalable Lear-ned Index for Multicore Data Storage[C]//Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.2020:308-320.
[29]FERRAGINA P,VINCIGUERRA G.The PGM-index:a fully-dynamic compressed learned index with provable worst-case bounds[J].Proceedings of the VLDB Endowment,2020,13(8):1162-1175.
[30]WANG Y,TANG C,WANG Z,et al.SIndex:a scalable learned index for string keys[C]//Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems.2020:17-24.
[31]MCKENNEY P,KLEEN A,KRIEGER O,et al.Read-Copy Update[C]//Ottawa Linux Symposium.2001:175-184.
[32]BRONSON N G,CASPER J,CHAFI H,et al.A practical concurrent binary search tree[J].ACM Sigplan Notices,2010,45(5):257-268.
[33]SANG K C,HWANG S,KIM K,et al.Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems[C]//Proceedings of 27th International Conference on Very Large Data Bases.2001:181-190.
[34]WU X,NI F,JIANG S,et al.Wormhole:A Fast Ordered Index for In-memory Data Management[C]//Proceedings of the Fourteenth EuroSys Conference.2019:1-16.
[35]MAO Y,KOHLER E,MORRIS R.Cache Craftiness for FastMulticore Key-Value Storage[C]//Proceedings of the 7th ACM European Confe-rence on Computer Systems.2012:183-196.
[36]ILYAS I F,MARKL V,HAAS P,et al.CORDS:Automatic discovery of correlations and soft functional dependencies[C]//Proceedings of the 2004 ACM SIGMOD International Confe-rence on Management of Data.2004:647-658.
[37]KIMURA H,HUO G,RASIN A,et al.Correlation maps:acompressed access method for exploiting soft functional depen-dencies[J].Proceedings of the VLDB Endowment,2009,2(1):1222-1233.
[38]LIU X,LIN Z,WANG H.Novel Online Methods for Time Series Segmentation[J].IEEE Transactions on Knowledge and Data Engineering,2008,20(12):1616-1626.
[39]XU Z,ZHANG R,KOTAGIRI R,et al.An adaptive algorithm for online time series segmentation with error bound guarantee[C]//Proceedings of the 15th International Conference on Extending Database Technology.2012:192-203.
[40]O'ROURKE J.An on-line algorithm for fitting straight lines between data ranges[J].Communications of the ACM,1981,24(9):574-578.
[41]OKANOHARA D,SADAKANE K.Practical entropy-com-pressed rank/select dictionary[C]//2007 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments(ALENEX).2007:60-70.
[42]MOFFAT A,TURPIN A.Compression and coding algorithms[M].Berlin:Springer Science & Business Media,2002.
[43]NEUMANN T,MICHEL S.Smooth interpolating histogramswith error guarantees[C]//British National Conference on Databases.2008:126-138.
[44]MARCUS R,KIPF A,VAN RENEN A,et al.Benchmarking learned indexes[J].arXiv:2006.12804,2020.
[45]CROTTY A.Hist-Tree:Those Who Ignore It Are Doomed to Learn[C]//CIDR.2021:11-15.
[46]RAMSAK F,MARKL V,FENK R,et al.Integrating the UB-Tree into a Database System Kernel[C]//VLDB.2000:263-272.
[47]JAGADISH H V,OOI B C,TAN K L,et al.iDistance:An adaptive B+-tree based indexing method for nearest neighbor search[J].ACM Transactions on Database Systems(TODS),2005,30(2):364-397.
[48]SCHUH M A,WYLIE T,LIU C,et al.Approximating High-Dimensional Range Queries with kNN Indexing Techniques[C]//Berlin:Springer International Publishing.2014:369-380.
[49]BREIMAN L.Random forests[J].Machine Learning,2001,45(1):5-32.
[50]GAEDE V,GÜNTHER O.Multidimensional access methods[J].ACM Computing Surveys(CSUR),1998,30(2):170-231.
[51]NIEVERGELT J,HINTERBERGER H,SEVCIK K C.Thegrid file:An adaptable,symmetric multikey file structure[J].ACM Transactions on Database Systems(TODS),1984,9(1):38-71.
[52]MITAS L,MITASOVA H.Spatial interpolation[J].Geo-graphical Information Systems:Principles,Techniques,Management and Applications,1999,1(2):481-492.
[53]GUTTMAN A.R-trees:A dynamic index structure for spatial searching[C]//Proceedings of the 1984 ACM SIGMOD International Conference on Management of data.1984:47-57.
[54]ROYDEN H L,FITZPATRICK P.Real analysis[M].New York:Macmillan,1988.
[55]GARCIA E,GUPTA M.Lattice regression[J].Advances in Neural Information Processing Systems,2009,22:594-602.
[1] 徐夏, 张晖, 杨春明, 李波, 赵旭剑.
公平谱聚类方法用于提高簇的公平性
Fair Method for Spectral Clustering to Improve Intra-cluster Fairness
计算机科学, 2023, 50(2): 158-165. https://doi.org/10.11896/jsjkx.211100279
[2] 陈得鹏, 刘肖, 崔杰, 何道敬.
面向机器学习的成员推理攻击综述
Survey of Membership Inference Attacks for Machine Learning
计算机科学, 2023, 50(1): 302-317. https://doi.org/10.11896/jsjkx.220800227
[3] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[4] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[5] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[6] 王润安, 邹兆年.
基于物理操作级模型的查询执行时间预测方法
Query Performance Prediction Based on Physical Operation-level Models
计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074
[7] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[8] 张光华, 高天娇, 陈振国, 于乃文.
基于N-Gram静态分析技术的恶意软件分类研究
Study on Malware Classification Based on N-Gram Static Analysis Technology
计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[9] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[10] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[11] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[12] 肖治鸿, 韩晔彤, 邹永攀.
基于多源数据和逻辑推理的行为识别技术研究
Study on Activity Recognition Based on Multi-source Data and Logical Reasoning
计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[13] 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮.
一种基于异质模型融合的 Android 终端恶意软件检测方法
Android Malware Detection Method Based on Heterogeneous Model Fusion
计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[14] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[15] 许杰, 祝玉坤, 邢春晓.
机器学习在金融资产定价中的应用研究综述
Application of Machine Learning in Financial Asset Pricing:A Review
计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!