Computer Science ›› 2021, Vol. 48 ›› Issue (11A): 225-231.doi: 10.11896/jsjkx.201200066

• Big Data & Data Science • Previous Articles     Next Articles

Analysis of Workload Failure in Co-located Data Centers

JIANG Cong-feng1, YIN Ji-liang1, HU Hai-zhou1, YAN Long-chuan2, ZHANG Ji-lin3, WAN Jian4, QIU Ye-liang5   

  1. 1 School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China
    2 State Grid Electrical Information Communication Co.,Ltd.,Beijing 100053,China
    3 School of Cyberspace Security,Hangzhou Dianzi University,Hangzhou 310018,China
    4 School of Information and Electronic Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,China
    5 Alibaba Cloud Computing Co.,Ltd.,Hangzhou 311121,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:JIANG Cong-feng,born in 1980,Ph.D,professor,Ph.D.supervisor,is a member of China Computer Federation.His main research interests include cloud computing,system optimization and performance evaluation.
  • Supported by:
    National Key Research and DevelopmentPragram of China(2017YFB101000),National Natural Science Foundation of China(61972118) and Zhejiang Key Research and Development Program of China(2019C01059).

Abstract: Datacenter workload co-location can greatly increase the resource utilization of cloud data centers,while it also increases the scheduling complexity and job failures.In this paper,the cluster trace dataset from Alibaba Cloud is investigated,and the characteristics of batch workload failure rates and cluster resource utilization are studied.The main contributions and findings of this paper are as follows.First,Only a small portion of specific types of jobs account for the overall cluster failure rate and resource waste due to job failures.Second,the execution time of task failover in the Fuxi distributed scheduler can be quantified as Gaussian distribution,and the task scheduling delay can be quantified as Zipf distribution.Third,Based on the failed instances distribution on cluster nodes,it's found that the job failures randomly occur in the cluster spatially,and the failed jobs are prone to fail again after their failovers.Moreover,job failures occur in the cluster temporally but not evenly distributed in the cluster.Fourth,the mean time between failures of the cluster is calculated according to instance failure data,and the results show that most of the cluster nodes have the mean time between failures values as 1000 seconds,while a few of them have the mean time between failures values as 10000 seconds.

Key words: Co-located cluster, Distributed scheduling, Failure analysis, Workload characteristics

CLC Number: 

  • TN391
[1]XU G,XU C,JIANG S.Prophet:Scheduling executors withtime-varying resource demands on data-parallel computation frameworks [C]//2016 IEEE International Conference on Autonomic Computing (ICAC).Piscataway,NJ:IEEE,2016:45-54.
[2]YAN Y,GAO Y,CHEN Y,et al.Tr-spark:Transient computing for big data analytics [C]//Proceedings of the Seventh ACM Symposium on Cloud Computing.New York,NY:ACM,2016:484-496.
[3]JYOTHI S A,CURINO C,MENACHE I,et al.Morpheus:Towards automated slos for enterprise clusters [C]//12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16).Berkeley,CA:USENIX,2016:117-134.
[4]RAJAN K,KAKADIA D,CURINOC,et al.PerfOrator:elo-quent performance models for Resource Optimization [C]//Proceedings of the Seventh ACM Symposium on Cloud Computing.New York,NY:ACM,2016:415-427.
[5]CHEN W,RAO J,ZHOU X.Preemptive,low latency datacenter scheduling via lightweight virtualization [C]//2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17).Berkeley,CA:USENIX,2017:251-263.
[6]CORTEZ E,BONDE A,MUZIO A,et al.Resource central:Understanding and predicting workloads for improved resource management in large cloud platforms [C]//Proceedings of the 26th Symposium on Operating Systems Principles.New York,NY:ACM,2017:153-167.
[7]JIANG C,WANG Y,OU D,et al.EASE:Energy efficiency and proportionality aware virtual machine scheduling [C]//2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).Piscataway,NJ:IEEE,2018:65-68.
[8]QIU Y,JIANG C,WANG Y,et al.Energy aware virtual machine scheduling in data centers[J].Energies,2019,12(4):646.
[9]GARRAGHAN P,YANG R,WEN Z,et al.Emergent Failures:Rethinking Cloud Reliability at Scale[J].IEEE Cloud Computing,2018,5(5):12-21.
[10]PAN A,WANG X,LI H.Conceptual Modeling on Tencent's Distributed Database Systems [C]//International Conference on Conceptual Modeling.Cham:Springer,2018:12-24.
[11]KAUR H,CHHABRA A.Fault-aware advance reservationscheduling in heterogeneous computing systems[J].International Journal of Applied Engineering Research,2018,13(11):9636-9645.
[12]CHEN W,PI A,WANG S,et al.Characterizing scheduling delay for low-latency data analytics workloads [C]//2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).IEEE,2018:630-639.
[13]CAO X,ZHONG Y,ZHOU Y,et al.Interactive temporal recurrent convolution network for traffic prediction in data centers[J].IEEE Access,2018,6:5276-5289.
[14]JIANG C,HUANG W,REN Z,et al.Towards building a scalable data analytics system on clouds:An early experience on alicloud [C]//2018 IEEE 11th International Conference on Cloud Computing (CLOUD).Piscataway,NJ:IEEE,2018:891-895.
[15]MAZUMDAR S,KUMAR A S.Statistical analysis of a datacenter resource usage patterns:A case study [C]//Proceedings of the International Conference on Computing and Communication Systems.Singapore:Springer,2018:767-779.
[16]GE Z F,W J W,JIANG C F,et al.Analysis of resource utilization of co-located clusters[J].Chinese Journal of Computers,2020,43(6):1103-1122.
[17]WANG J W,GE Z F,JIANG C F,et al.Load characteristics and task scheduling optimization analysis of co-located data center[J].Computer Engineering and Science,2020,42(1):8-17.
[18]GitHub.The Alibaba ClusterData2018 trace data [EB/OL].(2018-12-13) [2019-04-30].https://github.com /alibaba/clusterdata.
[19]REISS C,TUMANOV A,GANGER G R,et al.Towards understanding heterogeneous clouds at scale:Google trace analysis[R].Intel Science and Technology Center for Cloud Computing,2012.
[20]LU C,YE K,XU G,et al.Imbalance in the cloud:An analysis on alibaba cluster trace [C]//2017 IEEE International Conference on Big Data (Big Data).Piscataway,NJ:IEEE,2017:2884-2892.
[21]LIU Q,YU Z.The elasticity and plasticity in semi-containerized co-locating cloud workload:A view from Alibaba trace [C]//Proceedings of the ACM Symposium on Cloud Computing.New York,NY:ACM,2018:347-360.
[22]CHEN W,YE K,WANG Y,et al.How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace [C]//2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS).Piscataway,NJ:IEEE,2018:102-109.
[23]CHENG Y,CHAI Z,ANWAR A.Characterizing co-located data-center workloads:An Alibaba case study [C]//9th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2018).New York,NY:ACM,2018:12:1-12:3.
[24]CHENG Y,ANWAR A,DUAN X.Analyzing Alibaba's co-located datacenter workloads [C]//2018 IEEE International Conference on Big Data (Big Data).Piscataway,NJ:IEEE,2018:292-297.
[25]DENG L,REN Y L,XU F,et al.Resource utilization analysis of Alibaba cloud [C]//International Conference on Intelligent Computing.Berlin,German:Springer,2018:183-194.
[26]JIANG C,HAN G,LIN J,et al.Characteristics of Co-allocated Online Services and Batch Jobs in Internet Data Centers:A Case Study from Alibaba Cloud[J].IEEE Access,2019,7:22495-22508.
[27]DEAN J,GHEMAWATS.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[28]MONU M,PALS.A Review on Storage and Large-Scale Pro-cessing of Data-Sets Using Map Reduce,YARN,SPARK,AVRO,MongoDB[C]//Proceedings of International Conference on Sustainable Computing in Science,Technology and Management.Jaipur,India:SSRN,2019:1-8.
[29]PRATT B,HOWBERT J J,TASMAN N I,et al.MR-tandem:parallel X! tandem using hadoop MapReduce on amazon Web services[J].Bioinformatics,2011,28(1):136-137.
[30]DEDE E,GOVINDARAJU M,GUNTERD,et al.Performanceevaluation of a mongodb and hadoop platform for scientific data analysis [C]//Proceedings of the 4th ACM workshop on Scientific cloud computing.New York,NY:ACM,2013:13-20.
[31]VAVILAPALLI V K,MURTHY A C,DOUGLAS C,et al.Apache Hadoop yarn:Yet another resource negotiator [C]//Proceedings of the 4th annual Symposium on Cloud Computing.New York,NY:ACM,2013:5.
[32]HINDMAN B,KONWINSKI A,ZAHARIA M,et al.Mesos:A platform for fine-grained resource sharing in the data center [C]//NSDI'11. Berkeley,CA:USENTX,2011:295-308.
[33]SCHWARZKOPF M,KONWINSKI A,ABD-EL-MALEK M,et al.Omega:flexible,scalable schedulers for large compute clusters [C]//Proceedings of the 8th ACM European Confe-rence on Computer Systems.New York,NY:ACM,2013:351-364.
[34]OUSTERHOUT K,WENDELL P,ZAHARIA M,et al.Spar-row:distributed,low latency scheduling [C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.New York,NY:ACM,2013:69-84.
[35]ZHANG Z,LI C,TAO Y,et al.Fuxi:a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale[J].Proceedings of the VLDB Endowment,2014,7(13):1393-1404.
[1] TIAN Yu-li, LI Ning. System Usage Analysis and Failure Analysis for Cloud Computing [J]. Computer Science, 2020, 47(12): 50-55.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!