计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220600256-6.doi: 10.11896/jsjkx.220600256

• 大数据&数据科学 • 上一篇    下一篇

GDLIN:一种利用梯度下降的学习索引

陈珊珊, 高隽, 马振禹   

  1. 南京邮电大学计算机学院 南京 210003
  • 出版日期:2023-06-10 发布日期:2023-06-12
  • 通讯作者: 陈珊珊(chenss@njupt.edu.cn)
  • 基金资助:
    国家自然科学基金面上项目(61972209)

GDLIN:A Learned Index By Gradient Descent

CHEN Shanshan1, GAO Jun2, MA Zhenyu3   

  1. School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
  • Online:2023-06-10 Published:2023-06-12
  • About author:CHEN Shanshan,born in 1980,Ph.D,associate professor.Her main research interest is large-scale distributed sto-rage systems and architectures.
  • Supported by:
    National Natural Science Foundation of China(61972209).

摘要: 在大数据时代,数据访问速度是衡量大规模存储系统性能的一个重要指标,而索引是用于提升数据库系统中数据存取性能的主要技术之一。近几年,使用机器学习模型代替B+树等传统索引,拟合数据分布规律,将数据的间接查找优化为函数直接计算的学习索引(Learned Index,LI)被提出,LI提高了查询的速度,减少了索引空间开销。但是LI的拟合误差较大,不支持插入等修改性操作。文中提出了一种利用梯度下降算法拟合数据的学习索引模型GDLIN(A Learned Index By Gradient Descent)。GDLIN利用梯度下降算法更好地拟合数据,减少拟合误差,缩短本地查找的时间;同时递归调用数据拟合算法,充分利用键的分布规律,构建上层结构,避免索引结构随着数据量而增大。另外,GDLIN利用链表解决LI不支持数据插入的问题。实验结果表明,GDLIN在无新数据插入的情况下,吞吐量是B+树的2.1倍;在插入操作占比为50%的情况下,是LI的1.08倍。

关键词: 学习索引, 梯度下降, 拟合数据模型, 链表

Abstract: In the era of big data,data access speed is an important indicator to measure the performance of large-scale storage systems.Index is one of the main technologies to improve data access performance in database system.In recent years,learned index(LI) is proposed,which uses machine learning models instead of traditional B+-tree indexes,leverages pattern about the under-lying data distribution to train the models and optimize the indirect search of data query into the direct search of function calculation,learned index can speed up queries and reduce the size of an index.However,the fitting effect of LI is general,and it assumes that the data is static and read-only,it does not support modification operations such as insertion.This paper presents GDLIN,a novel form of a learned index,which uses gradient descent algorithm to fit the data.Gradient descent algorithm can reduce the error between the predict position and the actual position,which can reduce the cost of local research.Besides,GDLIN recursive calls the construction algorithm until only one model is created,which makes full use of keys’ distribution,and avoids the increase of the size of index with the data volume.In addition,GDLIN uses the sorted linked list to address the problem of data insertion.Experiment results demonstrate GDLIN improves the lookup throughput by 2.1× compared with the traditional B+-trees without insertion.Besides,GDLIN improves the lookup performance by 1.08× compared with the LI when the factor of insertion is 0.5.

Key words: Learned index, Gradient descent, Fitting data model, Linked list

中图分类号: 

  • TP391
[1]BAYER R,UNTERAUER K.Prefix B-trees[J].ACM Transactions on Database Systems,1977,2(1):11-26.
[2]STONEBRAKER M.The Case for Partial Indexes[J].SIGMOD Record,1989,18(4):4-11.
[3]GALAKATOS A,MARKOVITCH M,BINNIG C,et al.FI-Ting-Tree:A Data-aware Index Structure[C]//The 2019 International Conference.2019:1189-1206.
[4]ZHOU X,CHAI C,LI G,et al.Database meets artificial intelligence:a survey[J].IEEE Transactions on Knowledge and Data Engineering,2020,34(3):1096-1116.
[5]CHEN L,GAO Y,LI X,et al.Efficient metric indexing for simi-larity search and similarity joins[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(3):556-571.
[6]FU T C.A review on time series data mining[J].Engineering Applications of Artificial Intelligence,2011,24(1):164-181.
[7]EICHINGER F,EFROS P,KARNOUSKOS S,et al.A time-series compression technique and its application to the smart grid[J].VLDB Journal,2015,24(2):193-218.
[8]ESCH R E,EASTMAN W L.Computational methods for bestspline function approximation[J].Journal of Approximation Theory,1969,2(1):85-96.
[9]KEOGH E,CHU S,HART D,et al.An online algorithm forsegmenting time series[C]//Proceedings 2001 IEEE International Conference on Data Mining.IEEE,2002.
[10]LIU XY,LIN Z,WANG H.Novel online methods for time series segmentation[J].IEEE Transactions on Knowledge & Data Engineering,2008,20(12):1616-1626.
[11]NEUMANN T,MICHEL S.Smooth interpolating histogramswith error guarantees[C]//British National Conference on Databases.2008.
[12]SHATKAY H,ZDONIK S B.Approximate queries and repre-sentations for large data sequences[C]//ICDE.2019:536-545.
[13]XU Z,ZHANG R,RAMAMOHANARAO K,et al.An adaptive algorithm for online time series segmentation with error bound guarantee[C]//EDBT.2012:192-203.
[14]WANG L,WANG W J,JIANG G X.Optimization for Smoo-thing Parameter in Process of Data Fitting[J].Computer Science,2015,42(9):226-229.
[15]KRASKA T,BEUTEL A,CHI E H,et al.The case for learned index structures[C]//Proceedings of the International Confe-rence on Management of Data.2018:489-504.
[16]ZHANG H,ANDERSEN D G,PAVLO A,et al.Reducing the storage overhead of main-memory OLTP databases with hybrid indexes[C]//International Conference on Management of Data.ACM,2016.
[17]ZUKOWSKI M,HEMAN S,NES N,et al.Super-scalar RAM-CPU cache compression[C]//International Conference on Data Engineering.IEEE,2006.
[18]BÖHM M,SCHLEGEL B,VOLK P B,et al.Efficient In-Memory indexing with generalized prefix tees[C]//DBLP.2011.
[19]RAO J,ROSS K A.Cache conscious indexing for Decision-Support in main memory[C]//Proceedings of the International Conference on Very Large Data(VLDB).Morgan Kaufmann/ACM,1999:78-89.
[20]GALAKATOS A,MARKOVITCH M,BINNIG C,et al.A-Tree:a bounded approximate index structure[J].arXiv:1801.10207,2018.
[21]ATHANASSOULIS M,AILAMAKI A.BF-Tree:approximate tree indexing[C]//International Conference on Very Large Databases.2014.
[22]HWANG D,KIM W H,WON Y,et al.Endurable transient inconsistency in byte-addressable persistent B+-tree[C]//16th USENIX Conference on File and Storage Technologies.2018:187.
[23]RAO J,ROSS K A.Making B+-trees cache conscious in main memory[C]//Proceedings of the Conference on Management of Data(SIGMOD).ACM,2000:475-486.
[24]CHEN S,GIBBONS P B,MOWRY T C.Improving index performance through prefetching[J].ACM SIGMOD Record,2002,30(2):235-246.
[25]SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outra-geously large neural networks:the sparsely-gated mixture-of-experts layer[C]//Proceedings of the International Conference.on Learning Representations.2017.
[26]BRAESS D.Chebyshev approximation by spline functions with free knots[J].IMA J NUMER ANAL,1992,12(5):357-366.
[27]COOPER B F,SILBERSTEIN A,TAM E,et al.Benchmarking cloud serving systems withYCSB[C]//Proceedings of the 1st ACM Symposium on Cloud Computing.2010:143-154.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!