计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 25-33.doi: 10.11896/jsjkx.220900045

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于水车模型的时序大数据快速存储

陆铭琛, 吕晏齐, 刘睿诚, 金培权   

  1. 中国科学技术大学计算机科学与技术学院 合肥 230027
  • 收稿日期:2022-09-05 修回日期:2022-10-28 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 金培权(jpq@ustc.edu.cn)
  • 作者简介:lmc123@mail.ustc.edu.cn
  • 基金资助:
    国家自然科学基金(62072419)

Fast Storage System for Time-series Big Data Streams Based on Waterwheel Model

LU Mingchen, LYU Yanqi, LIU Ruicheng, JIN Peiquan   

  1. School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China
  • Received:2022-09-05 Revised:2022-10-28 Online:2023-01-15 Published:2023-01-09
  • About author:LU Mingchen,born in 1997,master.His main research interests include LSM-tree and so on.
    JIN Peiquan,born in 1975,Ph.D,asso-ciate professor,is a senior member of China Computer Federation.His main research interests include databases and big data.
  • Supported by:
    National Natural Science Foundation of China(62072419).

摘要: 近年来,随着物联网的高速发展,传感器部署的规模日益壮大。大规模的传感器每秒都会产生大量数据流,并且数据的价值会随着时间的流逝逐渐降低。因此,存储系统不仅需要能承受高速到达的数据流带来的写入压力,还需要以最快的速度将数据持久化,以供后续的查询和分析。这对存储系统的写入性能提出了更高的要求。基于水车模型的快速存储系统可以满足大数据应用场景下的高速时序数据流快速存储需求。该系统部署在高速时序数据流和底层存储节点之间,利用多个数据桶构建一个逻辑上轮转的存储模型(类似于中国古代的水车),并且通过控制每个数据桶的状态来协调数据的写入和落盘。水车模型将数据桶分配给不同的底层存储节点,从而将瞬时写入压力均摊到多个底层存储节点上,并借助多节点的并行写入提高写吞吐。水车模型被部署在单机版MongoDB上,并和分布式MongoDB进行了实验对比。实验结果表明,水车模型可以有效提升系统的写吞吐,降低写入延迟,并且具有良好的横向可扩展性。

关键词: 时序大数据, 流式数据, 快速存储, 水车模型, 中间件

Abstract: With the rapid development of the Internet of Things,the scale of sensor deployment has been growing in recent years.Large-scale sensors generate massive streaming data every second,and the value of the data decreases over time.Therefore,the storage system needs to be able to withstand the write pressure brought by the high-speed arriving streaming data and persist the data as fast as possible for subsequent query and analysis.This poses a considerable challenge to the write performance of the storage system.The fast storage system based on the waterwheel model can meet the fast storage requirements of high-speed time-series data streams in big data application scenarios.The proposed system is deployed between high-speed streaming data and underlying storage nodes,using multiple data buckets to build a logically rotating storage model(similar to the ancient Chinese waterwheel),and coordinating data writing and persisting by controlling the state of each data bucket.Waterwheel sends data buckets to different underlying storage nodes,so that the instantaneous write pressure is evenly distributed to multiple underlying storage nodes,and the write throughput is improved with the help of multi-node parallel writing.The waterwheel model is deployed on a stand-alone version of MongoDB,and compared with the distributed MongoDB in experiments.The results show that the proposed system can effectively improve the write throughput of the system,reduce the write latency,and has good horizontal scalability.

Key words: Time-series big data, Streaming data, Fast storage, Waterwheel model, Middleware

中图分类号: 

  • TP311
[1]WANG C,HUANG X,QIAO J,et al.Apache IoTDB:time-series database for internet of things[J].Proceedings of the VLDB Endowment,2020,13(12):2901-2904.
[2]NIAZI S,ISMAIL M,HARIDI S,et al.HopsFS:Scaling Hierarchical File SystemMetadata Using NewSQL Databases[C]//15th USENIX Conference on File and Storage Technologies.2017:89-104.
[3]LIU X,HAN J,ZHONG Y,et al.Implementing WebGIS on Hadoop:A case study of improving small file I/O performance on HDFS[C]//2009 IEEE International Conference on Cluster Computing and Workshops.IEEE,2009:1-8.
[4]ZHANG Y,HAN W,WANG W,et al.Optimizing the storage of massive electronic pedigrees in HDFS[C]//2012 IEEE International Conference on the Internet of Things.2012:68-75.
[5]ZHUO S,WU X,ZHANG W,et al.Distributed file system and classification for small images[C]//2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber,Physical and Social Computing.IEEE,2013:2231-2234.
[6]NIAZI S,RONSTROM M,HARIDI S,et al.Size matters:Improving the performance of small files in hadoop[C]//Procee-dings of the 19th International Middleware Conference.2018:26-39.
[7]HAO X J.Research on big data storage and management technology of Internet of things[D].Hefei:University of Science and Technology of China,2017.
[8]RHEA S,WANG E,WONG E,et al.Littletable:A time-series database and its uses[C]//Proceedings of the 2017 ACM International Conference on Management of Data.2017:125-138.
[9]ADAMS C,ALONSO L,ATKIN B,et al.Monarch:Google's planet-scale in-memory time series database[J].Proceedings of the VLDB Endowment,2020,13(12):3181-3194.
[10]PELKONEN T,FRANKLIN S,TELLER J,et al.Gorilla:Afast,scalable,in-memory time series database[J].Proceedings of the VLDB Endowment,2015,8(12):1816-1827.
[11]CAO W,GAO Y,LI F,et al.Timon:A timestamped event database for efficient telemetry data processing and analytics[C]//Proceedings of the 2020 ACM International Conference on Mana-gement of Data.2020:739-753.
[12]WANG L,CAI R,FU T Z J,et al.Waterwheel:Realtime indexing and temporal range query processing over massive data streams[C]//2018 IEEE 34th International Conference on Data Engineering.2018:269-280.
[13]YANG F,TSCHETTER E,LEAUTE X,et al.Druid:A real-time analytical data store[C]//Proceedings of the 2014 ACM International Conference on Management of Data.2014:157-168.
[14]WANG Z,XUE J,SHAO Z.Heracles:an efficient storage model and data flushing for performance monitoring timeseries[J].Proceedings of the VLDB Endowment,2021,14(6):1080-1092.
[15]LI C,LI B,BHUIYAN M,et al.FluteDB:An efficient and scalable in-memory time series database for sensor-cloud[J].Journal of Parallel and Distributed Computing,2018,122:95-108.
[16]ANDERSEN M P,CULLER D E.BTrDB:Optimizing Storage System Design for Timeseries Processing[C]//14th USENIX Conference on File and Storage Technologies.2016:39-52.
[17]GUPTA T,SINGH R,PHANISHAYEE A,et al.Bolt:Data management for connected homes[C]//11th USENIX Symposium on Networked Systems Design and Implementation.2014:243-256.
[18]SHI X,FENG Z,LI K,et al.ByteSeries:an in-memory time series database for large-scale monitoring systems[C]//Procee-dings of the 11th ACM Symposium on Cloud Computing.2020:60-73.
[19]JENSEN S K,PEDERSEN T B,THOMSEN C.Modelardb:Modular model-based time series management with spark and cassandra[J].Proceedings of the VLDB Endowment,2018,11(11):1688-1701.
[20]JENSEN S K,PEDERSEN T B,THOMSEN C.Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB+[C]//2021 IEEE 37th International Conference on Data Engineering.IEEE,2021:1380-1391.
[21]BLALOCK D,MADDEN S,GUTTAG J.Sprintz:Time seriescompression for the internet of things[J].Proceedings of the ACM on Interactive,Mobile,Wearable and Ubiquitous Techno-logies,2018,2(3):1-23.
[22]YU X,PENG Y,LI F,et al.Two-level data compression using machine learning in time series database[C]//2020 IEEE 36th International Conference on Data Engineering.2020:1333-1344.
[23]LU L,PILLAI T S,GOPALAKRISHNAN H,et al.Wisckey:Separating keys from values in ssd-conscious storage[J].ACM Transactions on Storage,2017,13(1):1-28.
[1] 王绪亮, 聂铁铮, 唐欣然, 黄菊, 李迪, 闫铭森, 刘畅.
流式数据处理的动态自适应缓存策略研究
Study on Dynamic Adaptive Caching Strategy for Streaming Data Processing
计算机科学, 2020, 47(11): 122-127. https://doi.org/10.11896/jsjkx.190800093
[2] 吴斌烽.
基于微服务架构的物联网中间件设计
Design of IoT Middleware Based on Microservices Architecture
计算机科学, 2019, 46(6A): 580-584.
[3] 潘明明,李丁丁,汤庸,刘海.
一种基于中间件的异构数据库融合访问方法及系统
Design and Implemention of Accessing Hybrid Database Systems Based on Middleware
计算机科学, 2018, 45(5): 163-167. https://doi.org/10.11896/j.issn.1002-137X.2018.05.027
[4] 关炀,闫国玉,王颖,蒋遂平.
RFID室内实时定位系统的数据滤波方法
Data Filtration Method for RFID Based Indoor RTLS
计算机科学, 2017, 44(Z11): 293-296. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.062
[5] 刘博洋,马连博,朱云龙,邵伟平.
基于多层数据处理的嵌入式RFID中间件系统开发
Development of Embedded RFID Middleware System for Multilayer Data Processing
计算机科学, 2015, 42(Z11): 231-235.
[6] 丁扬,王淑刚,李石坚,潘纲.
Scudware Mobile:支持可穿戴设备协同的移动中间件
Scudware Mobile:Mobile Middleware for Collaboration of Wearable Devices
计算机科学, 2015, 42(9): 18-23. https://doi.org/10.11896/j.issn.1002-137X.2015.09.004
[7] 邵婧,陈左宁,殷红武,许国春.
面向PaaS云的信息流控制框架设计与实现
Design and Implementation of Information Flow Control Framework for PaaS
计算机科学, 2015, 42(12): 257-262.
[8] 任国超,王姜,马晓星.
ConUp:一个支持构件动态更新的SCA中间件系统
ConUp:SCA Middleware with Dynamic Component Updating Support
计算机科学, 2014, 41(9): 60-62. https://doi.org/10.11896/j.issn.1002-137X.2014.09.009
[9] 翁世南,杨 枨.
基于云服务的RFID流程定义语言的研究
Research on RFID Process Definition Language Based on Cloud Service
计算机科学, 2012, 39(Z11): 114-118.
[10] 谷青范,康介祥,冯国良,付宇卓.
动态自适应DDS实时中间件的研究与实现
Research on Implementation of Dynamic Adaptive Real-time Middleware Based on DDS
计算机科学, 2012, 39(7): 36-38.
[11] 胡智,闻英友,赵宏.
支持多应用任务的WSNs中间件的设计与实现
Design and Implementation of WSNs Middleware Supporting Multiple Application Task
计算机科学, 2012, 39(4): 49-52.
[12] 陈 昊,孙 辉,许 畅,马晓星.
一种支持自适应程序设计的移动机器人中间件
Mobile Robot Middleware Supporting Self-adaptive Programming
计算机科学, 2012, 39(10): 119-124.
[13] 姜美雷,丁丽丽,柏永斌,郭永康,孔祥源.
分布式频谱监测系统中间件技术研究
Research on the Technology of Middleware of Distribution Spectrum Monitoring System
计算机科学, 2011, 38(Z10): 288-292.
[14] 谭云松,韩建国.
一种面向服务的物联网中间件模型
Service-oriented Middleware Model for Internet of Things
计算机科学, 2011, 38(Z10): 1-3.
[15] 郑笛,王俊,贲可荣.
普适计算环境下基于中间件的上下文质量管理框架研究
Middleware-based Framework for the Quality Management of Context-aware Pervasive Applications
计算机科学, 2011, 38(11): 127-130.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!