计算机科学 ›› 2020, Vol. 47 ›› Issue (6A): 474-479.doi: 10.11896/JsJkx.190900046

• 数据库 & 大数据 & 数据科学 • 上一篇    下一篇

基于机器学习的HBase配置参数优化研究

徐江峰谭玉龙   

  1. 郑州大学信息工程学院 郑州 450000
  • 发布日期:2020-07-07
  • 通讯作者: 谭玉龙(YulongTan25@163.com)
  • 基金资助:
    中央高校基本科研业务费专项资金(20190605)

Research on HBase Configuration Parameter Optimization Based on Machine Learning

XU Jiang-feng and TAN Yu-long   

  1. School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China
  • Published:2020-07-07
  • About author:XU Jiang-feng, born in 1965, Ph.D, professor, is a member of China Computer Federation.His research interests include data encryption technology, and network security technology.
    TAN Yu-long, born in 1994, postgradua-te, is a member of China Computer Federation.His research interests include information security, network security technology, and machine lear-ning.
  • Supported by:
    This work was supported by the Fundamental Research Funds for the Central University (20190605).

摘要: HBase是一个分布式数据库管理系统,对于需要快速随机访问大量数据的应用程序,它正变得越来越流行。但是,它有许多性能关键配置参数,这些参数之间可能会以复杂的方式相互影响,这使得手动调整它们以获得最佳性能变得极其困难。文中提出了一种新的方法来自动调优给定HBase应用程序的配置参数,称为自动调优HBase 。其关键是建立一个以配置参数为输入的低成本性能模型。为此,系统地研究了不同的建模技术,并决定采用集成学习算法来构建性能模型。随后,利用遗传算法通过性能模型为应用程序搜索最优配置参数。因此,它可以快速且自动地识别一组配置参数值,以使应用程序的性能达到最佳。实验测试了Yahoo!云服务基准的5个应用程序,结果表明,与默认配置相比,优化后的吞吐量平均提高41%,最高可达97%。与此同时,HBase操作的延迟平均降低了11.3%,最高可达57%。

关键词: HBase, 机器学习, 性能建模, 性能优化, 自动调优

Abstract: HBase is a distributed database management system.For applications that require fast random access to large amounts of data,it is becoming increasingly popular.However,it has many performance-critical configuration parameters that can interact with each other in complex ways,making it extremely difficult to adJust them manually for optimal performance.In this paper,a new method is proposed to automatically tuning the configuration parameters of a given HBase application,called auto-tuning HBase.The key is to build a low-cost performance model with configuration parameters as input.Therefore,different modeling techniques are systematically studied,and the integrated learning algorithm is used to construct the performance model.Then the genetic algorithm is used to search for the optimal configuration parameters for the application through the performance model.As a result,it can quickly and automatically identify a set of configuration parameter values to maximize application performance.By testing the 5 applications with Yahoo! cloud service benchmark,experimental results show that,compared with the default configuration,the optimized throughput increases by 41% on average and can be up to 97%.At the same time,delays in HBase operations decrease by an average of 11.3% to as high as 57%.

Key words: Auto tuning, HBase, Machine learning, Performance modeling, Performance optimization

中图分类号: 

  • TP391
[1] COOPER B F,SILBERSTEIN A,TAM E,et al.Benchmarking cloud serving systems with YCSB//Proc.1st ACM Symp.Cloud Comput.(SoCC),New York,NY,USA,2010:143-154.
[2] HBase at Taobao,accessed on May 26,2017..http://www.eygle.com/digest/2012/03/hbase-at-taobao.html.
[3] Apache HBase Team.Apache HBase Reference Guide.http://hbase.apache.org/book.html.
[4] BAO X,LIU L,XIAO N,et al.Policy-driven configuration ma-nagement for NoSQL//Proc.IEEE 8th Int.Conf.Cloud Comput..2015:245-252.
[5] BREIMAN L.Bagging predictors.Mach.Learn.,1996,24(2):123-140.
[6] 赵宏,张洁,侯鲁健,等.并行GA_ANN预测模型研究.计算机工程与应用,2011(22).
[7] COOPER B F,SILBERSTEIN A,TAM E,et al.Benchmarkingcloud serving systems with YCSB//Proc.1st ACM Symp.Cloud Comput.(SoCC),New York,NY,USA,2010:143-154.
[8] BRODER A,MITZENMACHER M.Network applications of bloom fifilters:A survey.Internet Math.,2004,1(4):485-509.
[9] BREIMAN L.Random forests.Mach.Learn.,2001,45(1):5-32.
[10] EFRON B,TIBSHIRANI R J.An Introduction to Bootstrap .Boca Raton,FL,USA:CRC Press,1994.
[11] LIAW A,WIENER M.lassifification and regression by randomforest.R News,2002,2(3):18-22.
[12] COOPER B F,et al.PNUTS:Yahoo!’s hosted data serving platform.J.Proc.VLDB Endowment,2008,1(2):1277-1288.
[13] Apache Cassandra,accessed on May 26.http://incubator.apache.org/cassandra/.
[14] CALDER B,et al.Windows azure storage:A highly available cloud storage service with strong consistency//Proc.23rd ACM Symp.Oper.Syst.Principles,2011:143-157.
[15] Apache CouchDB,accessed on May 26,2017..http://couchdb.apache.org/.
[16] SCIORE E.SimpleDB:A simple Java-based multiuser syst forteaching database internals.ACM SIGCSE Bull.,2007,9(1):561-565.
[17] ProJect Voldemort,accessed on May 26.http://proJect-voldemort.com.
[1] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[4] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5] 张光华, 高天娇, 陈振国, 于乃文.
基于N-Gram静态分析技术的恶意软件分类研究
Study on Malware Classification Based on N-Gram Static Analysis Technology
计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[6] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[7] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[8] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[9] 陈钧吾, 余华山.
面向无尺度图的Δ-stepping算法改进策略
Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs
计算机科学, 2022, 49(6A): 594-600. https://doi.org/10.11896/jsjkx.210400062
[10] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[11] 肖治鸿, 韩晔彤, 邹永攀.
基于多源数据和逻辑推理的行为识别技术研究
Study on Activity Recognition Based on Multi-source Data and Logical Reasoning
计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[12] 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮.
一种基于异质模型融合的 Android 终端恶意软件检测方法
Android Malware Detection Method Based on Heterogeneous Model Fusion
计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[13] 许杰, 祝玉坤, 邢春晓.
机器学习在金融资产定价中的应用研究综述
Application of Machine Learning in Financial Asset Pricing:A Review
计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[14] 李野, 陈松灿.
基于物理信息的神经网络:最新进展与展望
Physics-informed Neural Networks:Recent Advances and Prospects
计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
[15] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!