计算机科学 ›› 2017, Vol. 44 ›› Issue (4): 43-46.doi: 10.11896/j.issn.1002-137X.2017.04.010
唐红艳,李影,贾统,袁小雍
TANG Hong-yan, LI Ying, JIA Tong and YUAN Xiao-yong
摘要: 通过分析Google集群中任务的失效次数和失效模式,找到具有高失效频次和连续失效特征的杀手级任务。杀手级任务不仅影响云计算系统上应用运行的可靠性与可用性,而且会浪费大量资源并显著增加调度负载。在杀手级任务资源使用模式的基础上,提出一种基于时间序列的在线识别方法,以利用资源使用时间序列在失效早期准确识别出杀手级任务并通知云计算系统采取前摄性失效恢复措施,从而避免不必要的重复调度和资源浪费。实验结果表明,该方法能够以98.5%的准确率在平均3%的失效时间内识别出杀手级任务,同时节约96.75%的系统资源。
[1] Google Cluster Data.https://code.google.com/p/googleclusterdata/wiki/ClusterData2011_2. [2] WANG Y J,SUN W D,ZHOU S,et al.Key Technologies of Distributed Storage for Cloud Computing[J].Journal of Software,2012,23(4):962-986.(in Chinese) 王意洁,孙伟东,周松,等.云计算环境下的分布存储关键技术[J].软件学报,2012,23(4):962-986. [3] REISS C,TUMANOV A,GANGER G R,et al.Towards understanding heterogeneous clouds at scale:Google trace analysis:Technical Report ISTC-CC-TR-12-101[R].Intel Science and Technology Center for Cloud Computing,2012:84. [4] SOUALHIA M,KHOMH F,TAHAR S.Predicting Scheduling Failures in the Cloud[J].arXiv preprint arXiv:1507.03562,2015. [5] REISS C,WILKES J,HELLERSTEIN J L.Google cluster-usage traces:format+ schema[R].Google Inc.,Mountain View,CA,USA,2011. [6] GARRAGHAN P,TOWNEND P,XU J.An empirical failure-analysis of a large-scale cloud computing environment[C]∥2014 IEEE 15th International Symposium on High-Assurance Systems Engineering (HASE).IEEE,2014:113-120. [7] REISS C,TUMANOV A,GANGER G R,et al.Heterogeneity and dynamicity of clouds at scale:Google trace analysis[C]∥Proceedings of the Third ACM Symposium on Cloud Computing.ACM,2012:7. [8] MISHRA A K,HELLERSTEIN J L,CIRNE W,et al.Towards characterizing cloud backend workloads:insights from Google compute clusters[J].Acm Sigmetrics Performance Evaluation Review,2010,37(4):34-41. [9] DI S,KONDO D,CAPPELLO F.Characterizing Cloud Applications on a Google Data Center[C]∥2013 42nd International Conference on Parallel Processing (ICPP).IEEE,2013:468-473. [10] CHEN X,LU C D,PATTABIRAMAN K.Failure analysis ofjobs in compute clouds:A google cluster case study[C]∥2014 IEEE 25th International Symposium on Software Reliability Engineering (ISSRE).IEEE,2014:167-177. [11] CHEN X,LU C D,PATTABIRAMAN K.Failure Prediction of Jobs in Compute Clouds:A Google Cluster Case Study[C]∥2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).IEEE,2014:341-346. [12] FADISHEI H,SAADATFAR H,D ELDARI H.Job failure prediction in grid environment based on workload characteristics[C]∥14th International CSI Computer Conference,2009(CSICC 2009).IEEE,2009:329-334. [13] RAO X,WANG H M,CHEN Z B,et al.Detecting Faults by Tracing Companion States in Cloud Computing Systems[J].Journal of Computers,2012,35(5):856-870.(in Chinese) 饶翔,王怀民,陈振邦,等.云计算系统中基于伴随状态追踪的故障检测机制[J].计算机学报,2012,35(5):856-870. [14] WATABABE Y,OTSUKA H,SONODA M,et al.Online failure prediction in cloud datacenters by real-time message pattern learning[C]∥2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom).IEEE,2012:504-511. [15] CHALERMARREWONG T,ACHALAKUL T,SEE S C W.Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis[C]∥2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS).IEEE,2012:794-799. [16] LIN R,WU B,YANG F,et al.An efficient adaptive failure detection mechanism for cloud platform based on volterra series[J].China Communications,2014,11(4):1-12. |
No related articles found! |
|