计算机科学 ›› 2020, Vol. 47 ›› Issue (11): 122-127.doi: 10.11896/jsjkx.190800093

• 数据库&大数据&数据科学 • 上一篇    下一篇

流式数据处理的动态自适应缓存策略研究

王绪亮1, 聂铁铮1, 唐欣然2, 黄菊1, 李迪1, 闫铭森1, 刘畅1   

  1. 1 东北大学计算机科学与工程学院 沈阳 110169
    2 东北大学软件学院 沈阳 110169
  • 收稿日期:2019-08-16 修回日期:2019-12-17 出版日期:2020-11-15 发布日期:2020-11-05
  • 通讯作者: 聂铁铮(nietiezheng@mail.neu.edu.cn)
  • 作者简介:fracturesr@foxmail.com

Study on Dynamic Adaptive Caching Strategy for Streaming Data Processing

WANG Xu-liang1, NIE Tie-zheng1, TANG Xin-ran2, HUANG Ju1, LI Di1, YAN Ming-sen1, LIU Chang1   

  1. 1 School of Computer Science and Engineering,Northeastern University,Shenyang 110169,China
    2 College of Software,Northeastern University,Shenyang 110169,China
  • Received:2019-08-16 Revised:2019-12-17 Online:2020-11-15 Published:2020-11-05
  • About author:WANG Xu-liang,born in 1998,bachelor.His main research interests include data mining and database.
    NIE Tie-zheng,born in 1980,Ph.D,associate professor,master supervisor,is a member of China Computer Federation.His research interests include database,data integration and blockchain.

摘要: 在现代大数据处理应用场景中,流数据处理技术的应用十分广泛。消息中间件或消息队列常在流数据处理中起到数据缓冲的作用。Apache Kafka常被用作数据缓冲中间件,Kafka的工作性能在很大程度上决定着应用系统整体的性能。在实际应用中,Kafka的上游数据源所产生的数据流量通常是不稳定的,静态的缓存策略不能适应这种多变的生产环境。针对这一问题,如果存在一种策略能根据上游流量变化动态调整数据缓存,就能增强系统对环境的适应能力,实现流数据缓存处理的实时性和吞吐量性能的提升。动态缓存策略采用对上游数据流量监控的方法,通过使用ARIMA模型对未来流量进行预测,提前调整流数据存储转发设置。流数据缓存设置参数的最佳值来源于在各压力下对中间件系统性能进行实验得到的结果的多目标优化。对比实验结果证明,在流数据高峰到达期间,策略在保证一定最大延迟的前提下可以使Apache Kafka的数据缓冲吞吐量性能提高150%以上,从而提高了系统的整体性能。

关键词: Apache Kafka平台, 多目标优化, 流数据处理, 时序预测, 消息中间件

Abstract: In current scenarios of the big data processing application,the streaming data processing technique is widely used.Message middleware or message queue is usually applied as the data buffer in streaming data processing.Apache Kafka is often used as the data buffer middleware.The performance of Kafka largely determines the overall performance of the application system.In practical applications,the streaming data generated by upstream data sources is usually unstable,and the static data caching strategy cannot adapt to this variable production environment.In view of this problem,if there is a strategy that can dynamically adjust the data cache according to the upstream traffic changes,the adaptability of the system to environment can be enhanced,the real-time processing of streaming data caching can be realized and the throughput performance can also be improved.In the dynamic caching strategy,a method of monitoring the upstream data traffic is proposed,and the ARIMA model is used to predict the future traffic of data streaming,so as to adjust the settings of streaming data storage in advance.The optimum setting parameter of streaming data cache comes from multi-objective optimization of the experimental results of middleware system performance under various pressures.Comparative experimental results show that,during the peak period of streaming data,the strategy can improve the throughput performance of Apache Kafka by more than 150% while guaranteeing a certain maximum delay,thus the overall performance of the message middleware system can be improved.

Key words: Apache Kafka, Message middleware, Multi-objective optimization, Streaming data processing, Time series forecast

中图分类号: 

  • TP311
[1] LIU Y,WANG F,YANG M C.“Fast” and “Flexible” Big Data-Flexible Storage Technology in the Big Data Era [C]//2015 Annual Meeting of the Information and Communication Network Technology Committee of the Chinese Communication Society.Beijing,China:Information and Communication Network Technology Committee of the Chinese Communication Society,2015.
[2] YANG C,WENG Z J,MENG X F,et al.Astronomical Big Data Challenge and Real-time Processing Technology [J].Computer Research and Development,2017,54(2):248-257.
[3] LI L H,LI H Y,ZHANG F.Design and Implementation of Message Middleware[J].Computer Engineering,2000,26(1):46-48.
[4] WANG G,KOSHY J,SUBRAMANIAN S,et al.Building aReplicated logging system with Apache Kafka[J].Proceedings of the VLDB Endowment,2015,8(12):1654-1655.
[5] ICHINOSE A,TAKEFUSA A,NAKADA H,et al.A study of a video analysis framework using Kafka and spark streaming[C] //2017 IEEE International Conference on Big Data.Boston,MA,USA:IEEE,2017.
[6] MICHAEL D G,AZHARUDDIN V,GAURAV J,et al.Real-time Processing of IoT Events with Historic data using Apache Kafka and Apache Spark with Dashing framework[C] //20172nd IEEE International Conference on Recent Trends in Electronics,Information and Communication Technology (RTEICT).Sri Venkateshwara Coll Engn,Bangalore,INDIA:IEEE,2017.
[7] ZHOU S J,LIU J D,QIN Z G.Research on Message Queuing Technology:Review and an Example [J].Computer Science,2002,29(2):84-86.
[8] LI L N,WEI X H,LI X,et al.Elastic resource allocation for load burst sensing in stream data processing [J].Journal of Compu-ter Science,2018,41(10):2193-2208.
[9] CUI X C,YU X H,LIU Y,et al.Overview of distributed flowprocessing technology [J].Computer Research and Development,2015,52(2):318-332.
[10] KLEPPMANN M,KERPS J.KAFKA,Samzaand the Unix Philosophy of Distributed Data [EB/OL].[2019-08-01].http://sites.computer.org/debull/A15dec/p4.pdf.
[11] WANG Z Y.Research and Implementation of Performance Modeling and Optimization Technology for Distributed Message System Kafka [D] :Xi'an:Xi'an University of Electronic Science and Technology,2017.
[12] KREPS J,NARKHEDE N,RAO N.Kafka:a Distributed Messaging System for Log Processing[EB/OL].(2011-06-20)[2019-08-01].https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf.
[13] ZHENG B Q,ZOU H X,HU X J.Research on Network Public Opinion Prediction Based on Turning Point [J].Computer Science,2018,45(S2):539-541.
[14] ZITZLER E,DEB K,THIELE I.Comparison of multiobjective evolutionary algorithms;Empirical results[J].Evolutionary Computation,2000,8(2):173-195.
[15] SRINIVAS N,DEB K.Multiobjective optimization using non-dominated sorting in genetic algorithms[J].Evolutionary Computation,1994,2(3):221-248.
[16] TIAN Y,CHENG R,ZHANG X,et al.PlatEMO:A MATLAB Platform for Evolutionary Multi-Objective Optimization[J].IEEE Computational Intelligence Magazine,2017,12(4):73-87.
[1] 蔡欣雨, 冯翔, 虞慧群.
自适应权重的级联增强节点的宽度学习算法
Adaptive Weight Based Broad Learning Algorithm for Cascaded Enhanced Nodes
计算机科学, 2022, 49(6): 134-141. https://doi.org/10.11896/jsjkx.210500119
[2] 孙刚, 伍江江, 陈浩, 李军, 徐仕远.
一种基于切比雪夫距离的隐式偏好多目标进化算法
Hidden Preference-based Multi-objective Evolutionary Algorithm Based on Chebyshev Distance
计算机科学, 2022, 49(6): 297-304. https://doi.org/10.11896/jsjkx.210500095
[3] 李浩东, 胡洁, 范勤勤.
基于并行分区搜索的多模态多目标优化及其应用
Multimodal Multi-objective Optimization Based on Parallel Zoning Search and Its Application
计算机科学, 2022, 49(5): 212-220. https://doi.org/10.11896/jsjkx.210300019
[4] 彭冬阳, 王睿, 胡谷雨, 祖家琛, 王田丰.
视频缓存策略中QoE和能量效率的公平联合优化
Fair Joint Optimization of QoE and Energy Efficiency in Caching Strategy for Videos
计算机科学, 2022, 49(4): 312-320. https://doi.org/10.11896/jsjkx.210800027
[5] 王珂, 曲桦, 赵季红.
多域SFC部署中基于强化学习的多目标优化方法
Multi-objective Optimization Method Based on Reinforcement Learning in Multi-domain SFC Deployment
计算机科学, 2021, 48(12): 324-330. https://doi.org/10.11896/jsjkx.201100159
[6] 崔国楠, 王立松, 康介祥, 高忠杰, 王辉, 尹伟.
结合多目标优化算法的模糊聚类有效性指标及应用
Fuzzy Clustering Validity Index Combined with Multi-objective Optimization Algorithm and Its Application
计算机科学, 2021, 48(10): 197-203. https://doi.org/10.11896/jsjkx.200900061
[7] 朱汉卿, 马武彬, 周浩浩, 吴亚辉, 黄宏斌.
基于改进多目标进化算法的微服务用户请求分配策略
Microservices User Requests Allocation Strategy Based on Improved Multi-objective Evolutionary Algorithms
计算机科学, 2021, 48(10): 343-350. https://doi.org/10.11896/jsjkx.201100009
[8] 张清琪, 刘漫丹.
复杂网络社区发现的多目标五行环优化算法
Multi-objective Five-elements Cycle Optimization Algorithm for Complex Network Community Discovery
计算机科学, 2020, 47(8): 284-290. https://doi.org/10.11896/jsjkx.190700082
[9] 郑友莲, 雷德明, 郑巧仙.
求解高维多目标调度的新型人工蜂群算法
Novel Artificial Bee Colony Algorithm for Solving Many-objective Scheduling
计算机科学, 2020, 47(7): 186-191. https://doi.org/10.11896/jsjkx.190600089
[10] 孙敏, 陈中雄, 叶侨楠.
云环境下基于HEDSM的工作流调度策略
Workflow Scheduling Strategy Based on HEDSM Under Cloud Environment
计算机科学, 2020, 47(6): 252-259. https://doi.org/10.11896/jsjkx.190400047
[11] 赵松辉, 任志磊, 江贺.
软件升级问题的多目标优化方法
Multi-objective Optimization Methods for Software Upgradeability Problem
计算机科学, 2020, 47(6): 16-23. https://doi.org/10.11896/jsjkx.200400027
[12] 夏春艳, 王兴亚, 张岩.
基于多目标优化的测试用例优先级排序方法
Test Case Prioritization Based on Multi-objective Optimization
计算机科学, 2020, 47(6): 38-43. https://doi.org/10.11896/jsjkx.191100113
[13] 董明刚,刘宝,敬超.
模糊自适应排序变异多目标差分进化算法
Multi-objective Differential Evolution Algorithm with Fuzzy Adaptive Ranking-based Mutation
计算机科学, 2019, 46(7): 224-232. https://doi.org/10.11896/j.issn.1002-137X.2019.07.034
[14] 汪晨欣, 杨家海, 庄奕, 罗念龙.
未来网络试验设施的节点资源调度算法
Node Resource Scheduling for Future Network Experimentation Facility
计算机科学, 2019, 46(12): 95-100. https://doi.org/10.11896/jsjkx.190400106
[15] 赵云涛, 谌竟成, 李维刚.
融合自适应差分进化机制的多目标灰狼优化算法
Multi-objective Grey Wolf Optimization Hybrid Adaptive Differential Evolution Mechanism
计算机科学, 2019, 46(11A): 83-88.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!