Computer Science ›› 2025, Vol. 52 ›› Issue (2): 33-41.doi: 10.11896/jsjkx.240200001

• Database & Big Data & Data Science • Previous Articles     Next Articles

Performance Optimization of LSM-tree Based Key-Value Storage System Based on Fine-grained Cache andLearned Index

XU Ruida1, LI Yongkun1,2, XU Yinlong1,2   

  1. 1 School of Computer Science and Technology,University of Science and Technology of China,Hefei 230026,China
    2 Anhui Province Key Laboratory of High Performance Computing,Hefei 230026,China
  • Received:2024-02-01 Revised:2024-06-24 Online:2025-02-15 Published:2025-02-17
  • About author:XU Ruida,born in 2000,postgraduate.His main research interests include key-value stores and file systems.
    LI Yongkun,born in 1986,professor,is a member of CCF(No.33184M).His main research interests include parallel and distributed systems,key-value systems,memory systems,and cloud computing,etc.
  • Supported by:
    National Natural Science Foundation of China(62172382).

Abstract: In the context of the big data era where the amount of data is growing rapidly,log-structured merge-tree-based(LSM-Tree-based) key-value storage systems are widely deployed in many NoSQL systems because of their excellent flexibility and scalability.However,when querying data in the traditional LSM-Tree based key-value storage system,the read amplification problem,which is caused by searching through multiple SSTables,generates additional I/O overhead.In this paper,we present FCLI-LSM,a new optimization design for key-value storage systems.FCLI-LSM improves the performance of LSM-Tree-based key-value storage by combining the optimization methods of fine-grained key-value pair caching and learned index.By analyzing data access hotspots.FCLI-LSM implements three-level classification of hot,warm,and cold key-value data.FCLI-LSM designs a fine-grained caching mechanism for hot data based on key-value separation,effectively reducing the additional I/O overhead caused by the read amplification problem.In addition,FCLI-LSM also designs a cache affinity optimization for learned index,which further improves the query performance of storage system for warm key-value data.Experimental results show that,compared with existing read optimization solutions,FCLI-LSM can reduce average read latency by more than 40% and increase system throughput by more than 1.7 times.

Key words: Big data, Key-Value storage, LSM tree, Learned index, Caching

CLC Number: 

  • TP311
[1]REINSEL D,GANTZ J,RYDNING J.Data age 2025:the digitization of the world from edge to core[J/OL].https://www.seagate.com/files/www-content/our-story/trends/files/idc-sea-gate-dataage-whitepaper.pdf.
[2]O'NEIL P,CHENG E,GAWLICK D,et al.The log-structured merge-tree(LSM-tree)[J].Acta Informatica,1996,33:351-385.
[3]GHEMAWAT S,DEAN J.LevelDB[EB/OL].https://git-hub.com/google/leveldb.
[4]FACEBOOK.RocksDB[EB/OL].https://github.com/facebook/rocksdb.
[5]LU L,PILLAI T S,GOPALAKRISHNAN H,et al.Wisckey:Separating keys from values in ssd-conscious storage[J].ACM Transactions on Storage(TOS),2017,13(1):1-28.
[6]KRASKA T,BEUTEL A,CHI E H,et al.The case for learned index structures[C]//Proceedings of the 2018 International Conference on Management of Data.2018:489-504.
[7]DAI Y,XU Y,GANESAN A,et al.From {WiscKey} to Bourbon:A Learned Index for {Log-Structured} Merge Trees[C]//14th USENIX Symposium on Operating Systems Design and Implementation(OSDI 20).2020:155-171.
[8]ZHANG J,WANG F,DONG C.HaLSM:A Hotspot-awareLSM-tree based Key-Value Storage Engine[C]//2022 IEEE 40th International Conference on Computer Design(ICCD).IEEE,2022:179-186.
[9]CAO Z,DONG S,VEMURI S,et al.Characterizing,modeling,and benchmarking {RocksDB}{Key-Value} workloads at facebook[C]//18th USENIX Conference on File and Storage Technologies(FAST 20).2020:209-223.
[10]ATIKOGLU B,XU Y,FRACHTENBERG E,et al.Workloadanalysis of a large-scale key-value store[C]//Proceedings of the 12th ACM Sigmetrics/Performance Joint International Confe-rence on Measurement and Modeling of Computer Systems.2012:53-64.
[11]COOPER B F,SILBERSTEIN A,TAM E,et al.Benchmarking cloud serving systems with YCSB[C]//Proceedings of the 1st ACM Symposium on Cloud Computing.2010:143-154.
[12]COMER D.Ubiquitous B-tree[J].ACM Computing Surveys,1979,11(2):121-137.
[13]SIVASUBRAMANIAN S.Amazon dynamoDB:a seamlesslyscalable non-relational database service[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data.2012:729-730.
[14]CHANG F,DEAN J,GHEMAWAT S,et al.Bigtable:A distri-buted storage system for structured data[J].ACM Transactions on Computer Systems(TOCS),2008,26(2):1-26.
[15]LAKSHMAN A,MALIK P.Cassandra:a decentralized struc-tured storage system[J].ACM SIGOPS Operating Systems Review,2010,44(2):35-40.
[16]PUGH W.Skip lists:a probabilistic alternative to balanced trees[J].Communications of the ACM,1990,33(6):668-676.
[17]DGRAPH.BadgerDB[EB/OL].https://github.com/dgraph-io/badger.
[18]PINGCAP.Titan[EB/OL].https://github.com/tikv/titan.
[19]GALAKATOS A,MARKOVITCH M,BINNIG C,et al.Fiting-tree:A data-aware index structure[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1189-1206.
[20]DING J,MINHAS U F,YU J,et al.ALEX:an updatable adaptive learned index[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:969-984.
[21]XIE Q,PANG C,ZHOU X,et al.Maximum error-boundedpiecewise linear representation for online stream approximation[J].The VLDB Journal,2014,23:915-937.
[22]OPENSTREETMAP.OpenStreetMap[EB/OL].https://www.openstreetmap.org/.
[23]TENG D,GUO L,LEE R,et al.LSbM-tree:Re-enabling buffer caching in data management for mixed reads and writes[C]//2017 IEEE 37th International Conference on Distributed Computing Systems(ICDCS).IEEE,2017:68-79.
[24]WU F,YANG M H,ZHANG B,et al.{AC-Key}:Adaptive Ca-ching for {LSM-based}{Key-Value} Stores[C]//2020 USENIX Annual Technical Conference(USENIX ATC 20).2020:603-615.
[25]YANG L,WU H,ZHANG T,et al.Leaper:A learned prefe-tcher for cache invalidation in LSM-tree based storage engines[J].Proceedings of the VLDB Endowment,2020,13(12):1976-1989.
[26]GALAKATOS A,MARKOVITCH M,BINNIG C,et al.Fiting-tree:A data-aware index structure[C]//Proceedings of the 2019 International Conference on Management of Da Data.2019:1189-1206.
[27]FERRAGINA P,VINCIGUERRA G.The PGM-index:a fully-dynamic compressed learned index with provable worst-case bounds[J].Proceedings of the VLDB Endowment,2020,13(8):1162-1175.
[28]KIPF A,MARCUS R,VAN RENEN A,et al.RadixSpline:asingle-pass learned index[C]//Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management.2020:1-5.
[29]DAYAN N,ATHANASSOULIS M,IDREOS S.Monkey:Optimal navigable key-value store[C]//Proceedings of the 2017 ACM International Conference on Management of Da Data.2017:79-94.
[30]LI Y,TIAN C,GUO F,et al.{ElasticBF}:Elastic Bloom Filter with Hotness Awareness for Boosting Read Performance in Large {Key-Value} Stores[C]//2019 USENIX Annual Technical Conference(USENIX ATC 19).2019:739-752.
[1] LIANG Zheheng, WU Yuewen, LI Yongjian , ZHANG Xiaolu , SHEN Guiquan, SU Lingang, LIU Junle. Resource Preference-sensitive Cloud Configuration Recommendation Method for Big DataApplications [J]. Computer Science, 2025, 52(6A): 240800114-9.
[2] FU Xiong, SONG Zhaoyang, WANG Junchang, DENG Song. Study on Distributed Hybrid Storage Based on Erasure Coding and Replication [J]. Computer Science, 2025, 52(2): 42-47.
[3] ZHANG Manjing, HE Yulin, LI Xu, HUANG Zhexue. Distributed Two-stage Clustering Method Based on Node Sampling [J]. Computer Science, 2025, 52(2): 134-144.
[4] LIU Wei, SUN Jia, WANG Peng, CHEN Yafan. Development on Methods and Applications of Cognitive Computing of Urban Big Data [J]. Computer Science, 2024, 51(7): 49-58.
[5] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[6] CHEN Pan, CHEN Hongmei, LUO Chuan. Academic Influence Ranking Algorithm Based on Topic Reputation and Dynamic HeterogeneousNetwork [J]. Computer Science, 2024, 51(3): 81-89.
[7] YAN Jiahe, LI Honghui, MA Ying, LIU Zhen, ZHANG Dalin, JIANG Zhouxian, DUAN Yuhang. Multi-source Heterogeneous Data Fusion Technologies and Government Big Data GovernanceSystem [J]. Computer Science, 2024, 51(2): 1-14.
[8] CHEN Shanshan, GAO Jun, MA Zhenyu. GDLIN:A Learned Index By Gradient Descent [J]. Computer Science, 2023, 50(6A): 220600256-6.
[9] FAN Shuhuan, HOU Mengshu. Dataspace:A New Data Organization and Management Model [J]. Computer Science, 2023, 50(5): 115-127.
[10] HU Xuegang, LI Yang, WANG Lei, LI Peipei, YOU Zhuhong. Key Technologies of Intelligent Identification of Biomarkers:Review of Research on Association Prediction Between Circular RNA and Disease [J]. Computer Science, 2023, 50(4): 369-387.
[11] JIANG Chuanyu, HAN Xiangyu, YANG Wenrui, LYU Bohan, HUANG Xiaoou, XIE Xia, GU Yang. Survey of Medical Knowledge Graph Research and Application [J]. Computer Science, 2023, 50(3): 83-93.
[12] MA Wensheng, HOU Xilin, WANG Hongbo, LIU Sen. Study on Value Calculation of Big Data Based on Granular Tree and Usage Relationship [J]. Computer Science, 2023, 50(11A): 230300109-8.
[13] WU Chun, CHEN Long, SUN Yifei, WU Jigang. Fairness-aware Service Caching and Task Offloading with Cooperative Mobile Edge Computing [J]. Computer Science, 2023, 50(11A): 230200095-8.
[14] ZHANG Junna, CHEN Jiawei, BAO Xiang, LIU Chunhong, YUAN Peiyan. Cost-minimizing Task Offload Strategy for Mobile Devices Under Service Cache Constraint [J]. Computer Science, 2023, 50(10): 275-281.
[15] WANG Yitan, WANG Yishu, YUAN Ye. Survey of Learned Index [J]. Computer Science, 2023, 50(1): 1-8.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!