计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240800114-9.doi: 10.11896/jsjkx.240800114

• 大数据&数据科学 • 上一篇    下一篇

资源偏好敏感的大数据应用云配置推荐方法

梁哲恒1,2, 吴悦文4,5, 李永健3, 张小陆1,2, 沈桂泉1,2, 苏林刚6, 刘均乐3   

  1. 1 广东电网有限责任公司信息中心 广州 510000
    2 南方电网网络空间安全联合实验室 广州 510000
    3 中山供电局 广东 中山 528400
    4 中国科学院软件研究所 北京 100190
    5 基础软件与系统重点实验室(中国科学院) 北京 100190
    6 百度在线网络技术(北京)有限公司 北京 100085
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 吴悦文(wuyuewen@otcaix.iscas.ac.cn)
  • 作者简介:(liangzheheng@qq.com)
  • 基金资助:
    广东电网有限责任公司大规模智能电网设备的流式数据处理技术研究项目(037800KC23090006);国家自然科学基金项目(62302489)

Resource Preference-sensitive Cloud Configuration Recommendation Method for Big DataApplications

LIANG Zheheng1,2, WU Yuewen4,5, LI Yongjian3, ZHANG Xiaolu1,2, SHEN Guiquan1,2, SU Lingang6, LIU Junle3   

  1. 1 Information Center of Guangdong Power Grid Limited Liability Company,Guangzhou 510000,China
    2 Joint Laboratory on Cyberspace Security,China Southern Power Grid,Guangzhou 510000,China
    3 Zhongshan Power Supply Bureau,Zhongshan,Guangdong 528400,China
    4 Institute of Software,Chinese Academy of Science,Beijing 100190,China
    5 Key Laboratory of System Software(Chinese Academy of Sciences),Beijing 100190,China
    6 Baidu Online Network Technology(Beijing) Co.,Ltd.,Beijing 100085,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:LIANG Zheheng,born in 1986,master,is a member of CCF(No.T0223M).His main research interests include digital evaluation technology,Internet of Things and artificial intelligence.
    WU Yuewen,born in 1988,Ph.D,senior engineer,is a member of CCF(No.J6673M).His main research interests include performance optimization of cloud-edge systems and so on.
  • Supported by:
    Guangdong Power Grid Limited Liability Company(037800KC23090006) and National Natural Science Foundation of China(62302489).

摘要: 大数据和流式数据计算已被广泛用于支撑智能电网中异常监测与预警等场景。云计算是大数据和流式数据应用的主流运行支撑环境,选择合适的云资源优化其性能面临巨大挑战。当前基于全量配置搜索的方法以所有候选云配置作为搜索空间,存在搜索空间过大而容易陷入局部最优解的问题。针对该问题,提出了资源偏好敏感的大数据应用云配置推荐方法,采用资源偏好敏感的随机森林模型作为贝叶斯优化方法的概率模型,以权衡配置选项空间较大时搜索的准确性和开销。实验结果表明,所提方法相比于全量配置搜索方法CherryPick,在搜索结果的准确性提升23%的同时,可减少25%~44%的搜索次数;相比于数据驱动的方法RP-CH,搜索结果的准确性相差10%,但平均搜索次数可有效减少78%。

关键词: 大数据应用, 云配置推荐, 资源偏好, 主成分分析, 贝叶斯优化

Abstract: Big data and stream data computing have been widely used to support scenarios such as anomaly detection and early warning in smart grids.Cloud computing serves as the mainstream operating environment for big data and stream data applications.However,optimizing performance by selecting suitable cloud resources poses significant challenges.Current methods based on exhaustive configuration searches use all candidate cloud configurations as the search space,leading to excessively large search spaces and have the risk of getting stuck in local optima.To address this issue,this paper proposes a resource preference-sensitive cloud configuration recommendation method for big data applications.It employs a resource preference-sensitive random forest model as the probabilistic model in Bayesian optimization to balance the accuracy and cost of searches when the configuration option space is large.Experimental results show that,compared to the exhaustive configuration search method CherryPick,the proposed method improves search accuracy by 23% while reducing the number of searches by 25%~44%.Compared to the data-driven method RP-CH,the accuracy of search results is 10% lower,but the average number of searches is effectively reduced by 78%.

Key words: Big data applicaiton, Cloud configuration recommendation, Resource preference, PCA, Bayesian optimization

中图分类号: 

  • TP311
[1]Gartner[OL].https://www.gartner.com/en/doc/top-strate-gic-technology-trends-for-2024-industry-cloud-platforms.
[2]CORTEZ E,BONDE A,MUZIO A,et al.Resource central:Understanding and predicting workloads for improved resource management in large cloud platforms[C]//Proceedings of the 26th Symposium on Operating Systems Principles.2017:153-167.
[3]ALIPOURFARD O,LIU H Q,CHEN J S,et al.CherryPick:Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics[C]//NSDI.2017.
[4]YADWADKAR N J,HARIHARAN B,GONZALEZ J E,et al.Selecting the best vm across multiple public clouds:A data-drivenperformance modeling approach[C]//Proceedings of the 2017 Symposium on Cloud Computing.2017:452-465.
[5]WU Y W,WU H,REN J,et al.Heuristic based resource provisioning approach for big data analytics in cloud environment[J].Ruan Jian Xue Bao/Journal of Software,2020,31(6):1860-1874.
[6]WANG X B,LI S J,PUN C M,et al.A Parkinson’s Auxiliary Diagnosis Algorithm Based on a Hyperparameter Optimization Method of Deep Learning[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2024,21(4).
[7]HERODOTOU H,CHEN Y X,LU J H.A Survey on Automat-ic Parameter Tuning for Big Data Processing Systems[J].ACM Computing Surveys,2020,53(2):43.
[8]ULLAH F,BABAR M A,ALDEIDA A.Design and evaluation of adaptive system for big data cyber security analytics[J].Expert Systems with Applications,2022,207:117948.
[9]HSU C J,NAIR V,MENZIES T,et al.Scout:An experiencedguide to find the best cloud configuration[J].arXiv:1803.01296,2018.
[10]Scout[OL].https://github.com/oxhead/scout,2024.
[11]HUANG S,HUANG J,DAI J,et al.The HiBench benchmark suite:Characterization of the MapReduce-based data analysis[C]//2010 IEEE 26th International Conference on Data Engineering Workshops(ICDEW 2010).Long Beach,CA,2010:41-51.
[12]Spark perf[OL].https://github.com/databricks/spark-perfF.
[13]ZHANG M,LI W,ZHANG L,et al.A Pearson correlation-based adaptive variable grouping method for large-scale multi-objective optimization[J].Information Sciences,2023,639:118737.
[14]BANCHHOR C,SRINIVASU N.Integrating Cuckoo search-Grey wolf optimization and Correlative Naive Bayes classifier with Map Reduce model for big data classification[J].Data & Knowledge Engineering,2020,127:101788.
[15]ONAH D F O,PANG E L L,EL-HAJ M.A data-driven latent semantic analysis for automatic text summarization using lda topic modelling[C]//2022 IEEE International Conference on Big Data(Big Data).IEEE,2022:2771-2780.
[16]FU P T,LUO L L,GUO D K,et al.Jump Filter:Dynamic Sketch Design for Big Data Governance[J].Ruan Jian Xue Bao/Journal of Software,2023,34(3):1193-1212.
[17]JAIN N,JANA P K.LRF:A logically randomized forest algorithm for classification and regression problems[J].Expert Systems with Applications,2023,213:119225.
[18]XIA M Z,MALLADI S,GURURANGAN S,et al.LESS:Selecting Influential Data for Targeted Instruction Tuning[J].arXiv:2402.04333v3,2024.
[19]LIU Y Y,LI Y Y,SCHIELE B,et al.Online Hyperparameter Optimization for Class-Incremental Learning[C]//The Thirty-Seventh AAAI Conference on Artificial Intelligence(AAAI-23).2023.
[20]ARLIND K,MACIEJ J,MARTIN W,et al.Scaling Laws for Hyperparameter Optimization[C]//37th Conference on Neural Information Processing Systems(NeurIPS 2023).2023.
[21]LV Z,ZHANG W,CHEN Z,et al.Intelligent model updatestrategy for sequential recommendation[C]//Proceedings of the ACM on Web Conference 2024.2024:3117-3128.
[22]SUN R Y.Optimization for deep learning:An overview[J].Journal of the Operations Research Society of China,2020,8(2):249-294.
[23]ZHANG H,HUANG Q,ZHAI H,et al.Multi-temporal clouddetection based on robust PCA for optical remote sensing imagery[J].Computers and Electronics in Agriculture,2021,188:106342.
[24]XIE A,YIN F,XU Y,et al.Distributed Gaussian Processes Hyperparameter Optimization for Big Data Using Proximal ADMM[J].IEEE Signal Processing Letters,2019,26(8):1197-1201.
[25]VENKATARAMAN S,YANG Z,FRANKLIN M,et al.Er-nest:Efficient Performance Prediction for Large-Scale Advanced Analytics[C]//Networked Systems Design and Implementation.USENIX Association,2016.
[26]LAMA P,ZHOU X.AROMA:automated resource allocationand configuration of mapreduce environment in the cloud[C]//International Conference on Autonomic Computing.2012:63-72.
[27]HSU C J,NAIR V,FREEH V W,et al.Arrow:Low-level augmented bayesian optimization for finding the best cloud vm[C]//2018 IEEE 38th International Conference on Distributed Computing Systems(ICDCS).IEEE,2018:660-670.
[28]HERODOTOU H,CHEN Y,LU J.A survey on automatic parameter tuning for big data processing systems[J].ACM Computing Surveys(CSUR),2020,53(2):1-37.
[29]TP-DS benchmarks[EB/OL].https://github.com/IBM/spark-tpc-ds-performance-test.
[30]SHI J,ZOU J,LU J,et al.MRTuner:a toolkit to enable holistic optimization for mapreduce jobs[C]//Proceedings of the Vldb Endowment.2014:1319-1330.
[31]HERODOTOU H,DONG F,BABU S.No one(cluster) size fits all:automatic cluster sizing for data-intensive analytics[C]//ACM Symposium on Cloud Computing.2011:1-14.
[32]JUVE G,DEELMAN E.Wrangler:virtual cluster provisioning for the cloud[C]//International Symposium on High PERFORMANCE Distributed Computing.2011:277-278.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!