计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 225-231.doi: 10.11896/jsjkx.201200066

• 大数据&数据科学 • 上一篇    下一篇

混合部署数据中心失效负载分析

蒋从锋1, 殷继亮1, 胡海周1, 闫龙川2, 张纪林3, 万健4, 仇烨亮5   

  1. 1 杭州电子科技大学计算机学院 杭州310018
    2 国家电网有限公司信息通信分公司 北京100053
    3 杭州电子科技大学网络空间安全学院 杭州310018
    4 浙江科技学院信息与电子工程学院 杭州310023
    5 阿里云计算有限公司 杭州311121
  • 出版日期:2021-11-10 发布日期:2021-11-12
  • 通讯作者: 蒋从锋(cjiang@hdu.edu.cn)
  • 基金资助:
    国家重点研发计划项目(2017YFB101000);国家自然科学基金面上项目(61972118);浙江省重点研发计划项目(2019C01059)

Analysis of Workload Failure in Co-located Data Centers

JIANG Cong-feng1, YIN Ji-liang1, HU Hai-zhou1, YAN Long-chuan2, ZHANG Ji-lin3, WAN Jian4, QIU Ye-liang5   

  1. 1 School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China
    2 State Grid Electrical Information Communication Co.,Ltd.,Beijing 100053,China
    3 School of Cyberspace Security,Hangzhou Dianzi University,Hangzhou 310018,China
    4 School of Information and Electronic Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,China
    5 Alibaba Cloud Computing Co.,Ltd.,Hangzhou 311121,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:JIANG Cong-feng,born in 1980,Ph.D,professor,Ph.D.supervisor,is a member of China Computer Federation.His main research interests include cloud computing,system optimization and performance evaluation.
  • Supported by:
    National Key Research and DevelopmentPragram of China(2017YFB101000),National Natural Science Foundation of China(61972118) and Zhejiang Key Research and Development Program of China(2019C01059).

摘要: 数据中心工作负载混合部署在显著提升云数据中心的资源利用率的同时,也增加了调度的复杂性和作业的失效率。以阿里云发布的数据中心日志数据集cluster-trace-v2018为例,从离线批处理工作负载角度出发,详细地分析了不同类型工作负载在成功率和资源利用上的特征。主要发现如下:1)少量类型作业的失效会影响集群整体作业成功率并造成集群资源的浪费;2)伏羲分布式调度系统在任务故障切换执行时间上满足高斯分布,在任务调度延迟方面满足齐夫分布;3)通过分析失败实例在集群节点上的分布,发现集群作业发生失败在空间上具有随机性,且失败的实例很容易再次发生失败,而在时间上集群整体失败率则存在不平衡性;4)以任务实例的失效为基准,计算了集群节点的平均无故障时间,大部分节点的平均无故障时间在1 000 s左右,小部分节点的任务实例失效率低,其平均无故障时间可达10 000 s以上。

关键词: 混合部署, 工作负载特征, 分布式调度, 失效分析

Abstract: Datacenter workload co-location can greatly increase the resource utilization of cloud data centers,while it also increases the scheduling complexity and job failures.In this paper,the cluster trace dataset from Alibaba Cloud is investigated,and the characteristics of batch workload failure rates and cluster resource utilization are studied.The main contributions and findings of this paper are as follows.First,Only a small portion of specific types of jobs account for the overall cluster failure rate and resource waste due to job failures.Second,the execution time of task failover in the Fuxi distributed scheduler can be quantified as Gaussian distribution,and the task scheduling delay can be quantified as Zipf distribution.Third,Based on the failed instances distribution on cluster nodes,it's found that the job failures randomly occur in the cluster spatially,and the failed jobs are prone to fail again after their failovers.Moreover,job failures occur in the cluster temporally but not evenly distributed in the cluster.Fourth,the mean time between failures of the cluster is calculated according to instance failure data,and the results show that most of the cluster nodes have the mean time between failures values as 1000 seconds,while a few of them have the mean time between failures values as 10000 seconds.

Key words: Co-located cluster, Workload characteristics, Distributed scheduling, Failure analysis

中图分类号: 

  • TN391
[1]XU G,XU C,JIANG S.Prophet:Scheduling executors withtime-varying resource demands on data-parallel computation frameworks [C]//2016 IEEE International Conference on Autonomic Computing (ICAC).Piscataway,NJ:IEEE,2016:45-54.
[2]YAN Y,GAO Y,CHEN Y,et al.Tr-spark:Transient computing for big data analytics [C]//Proceedings of the Seventh ACM Symposium on Cloud Computing.New York,NY:ACM,2016:484-496.
[3]JYOTHI S A,CURINO C,MENACHE I,et al.Morpheus:Towards automated slos for enterprise clusters [C]//12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16).Berkeley,CA:USENIX,2016:117-134.
[4]RAJAN K,KAKADIA D,CURINOC,et al.PerfOrator:elo-quent performance models for Resource Optimization [C]//Proceedings of the Seventh ACM Symposium on Cloud Computing.New York,NY:ACM,2016:415-427.
[5]CHEN W,RAO J,ZHOU X.Preemptive,low latency datacenter scheduling via lightweight virtualization [C]//2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17).Berkeley,CA:USENIX,2017:251-263.
[6]CORTEZ E,BONDE A,MUZIO A,et al.Resource central:Understanding and predicting workloads for improved resource management in large cloud platforms [C]//Proceedings of the 26th Symposium on Operating Systems Principles.New York,NY:ACM,2017:153-167.
[7]JIANG C,WANG Y,OU D,et al.EASE:Energy efficiency and proportionality aware virtual machine scheduling [C]//2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).Piscataway,NJ:IEEE,2018:65-68.
[8]QIU Y,JIANG C,WANG Y,et al.Energy aware virtual machine scheduling in data centers[J].Energies,2019,12(4):646.
[9]GARRAGHAN P,YANG R,WEN Z,et al.Emergent Failures:Rethinking Cloud Reliability at Scale[J].IEEE Cloud Computing,2018,5(5):12-21.
[10]PAN A,WANG X,LI H.Conceptual Modeling on Tencent's Distributed Database Systems [C]//International Conference on Conceptual Modeling.Cham:Springer,2018:12-24.
[11]KAUR H,CHHABRA A.Fault-aware advance reservationscheduling in heterogeneous computing systems[J].International Journal of Applied Engineering Research,2018,13(11):9636-9645.
[12]CHEN W,PI A,WANG S,et al.Characterizing scheduling delay for low-latency data analytics workloads [C]//2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).IEEE,2018:630-639.
[13]CAO X,ZHONG Y,ZHOU Y,et al.Interactive temporal recurrent convolution network for traffic prediction in data centers[J].IEEE Access,2018,6:5276-5289.
[14]JIANG C,HUANG W,REN Z,et al.Towards building a scalable data analytics system on clouds:An early experience on alicloud [C]//2018 IEEE 11th International Conference on Cloud Computing (CLOUD).Piscataway,NJ:IEEE,2018:891-895.
[15]MAZUMDAR S,KUMAR A S.Statistical analysis of a datacenter resource usage patterns:A case study [C]//Proceedings of the International Conference on Computing and Communication Systems.Singapore:Springer,2018:767-779.
[16]GE Z F,W J W,JIANG C F,et al.Analysis of resource utilization of co-located clusters[J].Chinese Journal of Computers,2020,43(6):1103-1122.
[17]WANG J W,GE Z F,JIANG C F,et al.Load characteristics and task scheduling optimization analysis of co-located data center[J].Computer Engineering and Science,2020,42(1):8-17.
[18]GitHub.The Alibaba ClusterData2018 trace data [EB/OL].(2018-12-13) [2019-04-30].https://github.com /alibaba/clusterdata.
[19]REISS C,TUMANOV A,GANGER G R,et al.Towards understanding heterogeneous clouds at scale:Google trace analysis[R].Intel Science and Technology Center for Cloud Computing,2012.
[20]LU C,YE K,XU G,et al.Imbalance in the cloud:An analysis on alibaba cluster trace [C]//2017 IEEE International Conference on Big Data (Big Data).Piscataway,NJ:IEEE,2017:2884-2892.
[21]LIU Q,YU Z.The elasticity and plasticity in semi-containerized co-locating cloud workload:A view from Alibaba trace [C]//Proceedings of the ACM Symposium on Cloud Computing.New York,NY:ACM,2018:347-360.
[22]CHEN W,YE K,WANG Y,et al.How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace [C]//2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS).Piscataway,NJ:IEEE,2018:102-109.
[23]CHENG Y,CHAI Z,ANWAR A.Characterizing co-located data-center workloads:An Alibaba case study [C]//9th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2018).New York,NY:ACM,2018:12:1-12:3.
[24]CHENG Y,ANWAR A,DUAN X.Analyzing Alibaba's co-located datacenter workloads [C]//2018 IEEE International Conference on Big Data (Big Data).Piscataway,NJ:IEEE,2018:292-297.
[25]DENG L,REN Y L,XU F,et al.Resource utilization analysis of Alibaba cloud [C]//International Conference on Intelligent Computing.Berlin,German:Springer,2018:183-194.
[26]JIANG C,HAN G,LIN J,et al.Characteristics of Co-allocated Online Services and Batch Jobs in Internet Data Centers:A Case Study from Alibaba Cloud[J].IEEE Access,2019,7:22495-22508.
[27]DEAN J,GHEMAWATS.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[28]MONU M,PALS.A Review on Storage and Large-Scale Pro-cessing of Data-Sets Using Map Reduce,YARN,SPARK,AVRO,MongoDB[C]//Proceedings of International Conference on Sustainable Computing in Science,Technology and Management.Jaipur,India:SSRN,2019:1-8.
[29]PRATT B,HOWBERT J J,TASMAN N I,et al.MR-tandem:parallel X! tandem using hadoop MapReduce on amazon Web services[J].Bioinformatics,2011,28(1):136-137.
[30]DEDE E,GOVINDARAJU M,GUNTERD,et al.Performanceevaluation of a mongodb and hadoop platform for scientific data analysis [C]//Proceedings of the 4th ACM workshop on Scientific cloud computing.New York,NY:ACM,2013:13-20.
[31]VAVILAPALLI V K,MURTHY A C,DOUGLAS C,et al.Apache Hadoop yarn:Yet another resource negotiator [C]//Proceedings of the 4th annual Symposium on Cloud Computing.New York,NY:ACM,2013:5.
[32]HINDMAN B,KONWINSKI A,ZAHARIA M,et al.Mesos:A platform for fine-grained resource sharing in the data center [C]//NSDI'11. Berkeley,CA:USENTX,2011:295-308.
[33]SCHWARZKOPF M,KONWINSKI A,ABD-EL-MALEK M,et al.Omega:flexible,scalable schedulers for large compute clusters [C]//Proceedings of the 8th ACM European Confe-rence on Computer Systems.New York,NY:ACM,2013:351-364.
[34]OUSTERHOUT K,WENDELL P,ZAHARIA M,et al.Spar-row:distributed,low latency scheduling [C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.New York,NY:ACM,2013:69-84.
[35]ZHANG Z,LI C,TAO Y,et al.Fuxi:a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale[J].Proceedings of the VLDB Endowment,2014,7(13):1393-1404.
[1] 曹义亲, 段也钰, 武丹. 基于WFSOA的2D-Otsu钢轨缺陷图像分割方法[J]. 计算机科学, 2020, 47(5): 154-160.
[2] 石静, 郑嘉利, 袁源, 王哲, 李丽. 基于Whittle索引的RFID多阅读器信道资源分配算法[J]. 计算机科学, 2019, 46(10): 122-127.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 杨宇舟,张凤荔,王勇. 基于K-MEANS聚类的分支定界算法在网络异常检测中的应用[J]. 计算机科学, 2012, 39(4): 60 -62 .
[2] 林曼筠 钱华林. 分布式拒绝服务攻击:原理和对策[J]. 计算机科学, 2000, 27(12): 41 -45 .
[3] 薛竹君,杨树强,束阳雪. 基于实体关系网络的微博文本摘要[J]. 计算机科学, 2016, 43(9): 77 -81 .
[4] 李树芳,安金霞,刘洋,陈良. 采用Clang/LLVM的C++源代码覆盖率分析插装方法[J]. 计算机科学, 2017, 44(11): 191 -194 .
[5] 马露, 裴伟, 朱永英, 王春立, 王鹏乾. 基于深度学习的跌倒行为识别[J]. 计算机科学, 2019, 46(9): 106 -112 .
[6] 赵露露, 沈玲, 洪日昌. 图像修复研究进展综述[J]. 计算机科学, 2021, 48(3): 14 -26 .
[7] 商希雪, 韩海庭, 朱郑州. 基于演化博弈的数据收益权分配机制设计[J]. 计算机科学, 2021, 48(3): 144 -150 .
[8] 李冰荣, 皮德常, 候梦如. 基于CNN和LSTM的移动对象目的地预测[J]. 计算机科学, 2021, 48(4): 70 -77 .
[9] 刘小龙, 韩芳, 王直杰. 基于知识表示的联合问答模型[J]. 计算机科学, 2021, 48(6): 241 -245 .
[10] 冯芙蓉, 张兆功. 目标轮廓检测技术新进展[J]. 计算机科学, 2021, 48(6A): 1 -9 .