计算机科学 ›› 2017, Vol. 44 ›› Issue (4): 43-46.doi: 10.11896/j.issn.1002-137X.2017.04.010

• NASAC 2015 • 上一篇    下一篇

基于时间序列分析的杀手级任务在线识别方法

唐红艳,李影,贾统,袁小雍   

  1. 北京大学软件与微电子学院 北京100871,北京大学软件与微电子学院 北京100871;北京大学软件工程国家工程研究中心 北京100871,北京大学软件与微电子学院 北京100871,北京大学软件与微电子学院 北京100871
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受深圳市科技计划重点项目(JSGG20140516162852628)资助

Time Series Based Killer Task Online Recognition Approach

TANG Hong-yan, LI Ying, JIA Tong and YUAN Xiao-yong   

  • Online:2018-11-13 Published:2018-11-13

摘要: 通过分析Google集群中任务的失效次数和失效模式,找到具有高失效频次和连续失效特征的杀手级任务。杀手级任务不仅影响云计算系统上应用运行的可靠性与可用性,而且会浪费大量资源并显著增加调度负载。在杀手级任务资源使用模式的基础上,提出一种基于时间序列的在线识别方法,以利用资源使用时间序列在失效早期准确识别出杀手级任务并通知云计算系统采取前摄性失效恢复措施,从而避免不必要的重复调度和资源浪费。实验结果表明,该方法能够以98.5%的准确率在平均3%的失效时间内识别出杀手级任务,同时节约96.75%的系统资源。

关键词: 云计算系统,杀手级任务,在线识别,时间序列,资源使用模式,失效频率

Abstract: By analyzing failure frequency and failure patterns in Google cluster dataset,this paper fond what are called as killer tasks that suffer from frequent and continuous failure.Killer task is a big concern of cloud system as it causes unnecessary resource wasting and significant increase of scheduling overhead.In this paper,an online recognition approach was proposed to make use of the resource usage time series to recognize killer tasks precisely at the very early stage of their occurrence so that proactive actions can be taken to avoid rescheduling and resource wasting.The experiment results show that the proposed approach performs a 98.5% precision in recognizing killer tasks at 3% of failure duration,with a 96.75% resource saving for the cloud system averagely.

Key words: Cloud system,Killer tasks,Online recognition,Time series,Resource usage pattern,Failure frequency

[1] Google Cluster Data.https://code.google.com/p/googleclusterdata/wiki/ClusterData2011_2.
[2] WANG Y J,SUN W D,ZHOU S,et al.Key Technologies of Distributed Storage for Cloud Computing[J].Journal of Software,2012,23(4):962-986.(in Chinese) 王意洁,孙伟东,周松,等.云计算环境下的分布存储关键技术[J].软件学报,2012,23(4):962-986.
[3] REISS C,TUMANOV A,GANGER G R,et al.Towards understanding heterogeneous clouds at scale:Google trace analysis:Technical Report ISTC-CC-TR-12-101[R].Intel Science and Technology Center for Cloud Computing,2012:84.
[4] SOUALHIA M,KHOMH F,TAHAR S.Predicting Scheduling Failures in the Cloud[J].arXiv preprint arXiv:1507.03562,2015.
[5] REISS C,WILKES J,HELLERSTEIN J L.Google cluster-usage traces:format+ schema[R].Google Inc.,Mountain View,CA,USA,2011.
[6] GARRAGHAN P,TOWNEND P,XU J.An empirical failure-analysis of a large-scale cloud computing environment[C]∥2014 IEEE 15th International Symposium on High-Assurance Systems Engineering (HASE).IEEE,2014:113-120.
[7] REISS C,TUMANOV A,GANGER G R,et al.Heterogeneity and dynamicity of clouds at scale:Google trace analysis[C]∥Proceedings of the Third ACM Symposium on Cloud Computing.ACM,2012:7.
[8] MISHRA A K,HELLERSTEIN J L,CIRNE W,et al.Towards characterizing cloud backend workloads:insights from Google compute clusters[J].Acm Sigmetrics Performance Evaluation Review,2010,37(4):34-41.
[9] DI S,KONDO D,CAPPELLO F.Characterizing Cloud Applications on a Google Data Center[C]∥2013 42nd International Conference on Parallel Processing (ICPP).IEEE,2013:468-473.
[10] CHEN X,LU C D,PATTABIRAMAN K.Failure analysis ofjobs in compute clouds:A google cluster case study[C]∥2014 IEEE 25th International Symposium on Software Reliability Engineering (ISSRE).IEEE,2014:167-177.
[11] CHEN X,LU C D,PATTABIRAMAN K.Failure Prediction of Jobs in Compute Clouds:A Google Cluster Case Study[C]∥2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).IEEE,2014:341-346.
[12] FADISHEI H,SAADATFAR H,D ELDARI H.Job failure prediction in grid environment based on workload characteristics[C]∥14th International CSI Computer Conference,2009(CSICC 2009).IEEE,2009:329-334.
[13] RAO X,WANG H M,CHEN Z B,et al.Detecting Faults by Tracing Companion States in Cloud Computing Systems[J].Journal of Computers,2012,35(5):856-870.(in Chinese) 饶翔,王怀民,陈振邦,等.云计算系统中基于伴随状态追踪的故障检测机制[J].计算机学报,2012,35(5):856-870.
[14] WATABABE Y,OTSUKA H,SONODA M,et al.Online failure prediction in cloud datacenters by real-time message pattern learning[C]∥2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom).IEEE,2012:504-511.
[15] CHALERMARREWONG T,ACHALAKUL T,SEE S C W.Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis[C]∥2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS).IEEE,2012:794-799.
[16] LIN R,WU B,YANG F,et al.An efficient adaptive failure detection mechanism for cloud platform based on volterra series[J].China Communications,2014,11(4):1-12.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!