共享集群基于HDFS的数据块密度调度策略

doi:10.11896/j.issn.1002-137X.2017.11A.108

计算机科学 ›› 2017, Vol. 44 ›› Issue (Z11): 510-515.doi: 10.11896/j.issn.1002-137X.2017.11A.108

共享集群基于HDFS的数据块密度调度策略

杜红光,雷州,陈圣波

上海大学计算机科学与技术上海200444,上海大学计算机科学与技术上海200444,上海大学计算机科学与技术上海200444

出版日期:2018-12-01 发布日期:2018-12-01

Data Block Density Scheduling Strategy Based on HDFS in Shared Cluster

DU Hong-guang, LEI Zhou and CHEN Sheng-bo

Online:2018-12-01 Published:2018-12-01

摘要/Abstract

摘要： 随着云计算技术和海量数据处理技术的发展,共享集群逐渐采用HDFS作为分布式文件系统并通过虚拟化的方式管理计算资源,为计算框架和应用提供运行资源,造成应用运行过程中计算资源和数据存储的分离。海量数据处理应用的数据本地性是影响其性能的关键因素之一。目前,共享集群管理框架调度器的研究主要集中在通过提升调度的并行度来提高系统的吞吐量和资源利用率,而其在调度的质量方面还存在一些缺陷,如应用的数据本地性问题。提出基于数据块密度的调度策略,来提高应用的数据本地性, 根据数据块的密度为应用等比例分配计算资源,减少应用运行过程中的跨主机I/O,从而提升应用的性能。实验表明,基于数据块密度的调度策略能够有效减少数据密集型作业的运行时间,该策略能够使应用达到90%的数据本地性。在测试应用WordCount和TeraSort中,该策略使应用缩短了20%左右的运行时间。

关键词: HDFS,数据块密度,共享集群,调度策略

Abstract: With the development of cloud computing technology and mass data processing technology,shared clusters use HDFS as a distributed file system and manage computing resources through virtualization to provide operational resources for computing frameworks and applications.The data localization of mass data processing applications is a key factor which affects its performance.At present,the research of shared cluster management framework’s scheduler mainly focuses on improving the throughput and resource utilization of the system by improving the parallelism of dispatching,and there are some defects in the quality of scheduling,such as the data locality.In this paper,a scheduling strategy based on data block density was proposed to improve the data locality of the application.By using this strategy,the performance of the application can be improved by reducing the cross-host I/O during the application operation.Experiments show that the scheduling strategy proposed in this paper can effectively reduce the running time of data-intensive operations.In the test case of WordCount and TeraSort with 2.5G data,the method of this paper achieved 90% data localization and shortened the operation by 20% time.

Key words: HDFS,Data block density,Shared cluster,Scheduling strategy

杜红光,雷州,陈圣波. 共享集群基于HDFS的数据块密度调度策略[J]. 计算机科学, 2017, 44(Z11): 510-515. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.108

DU Hong-guang, LEI Zhou and CHEN Sheng-bo. Data Block Density Scheduling Strategy Based on HDFS in Shared Cluster[J]. Computer Science, 2017, 44(Z11): 510-515. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.108

参考文献

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[2] WHITE T.Hadoop:The definitive guide[M].O’Reilly Media,Inc.,2012.
[3] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:cluster computing with working sets[J].HotCloud,2010,15(1):10.
[4] SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th symposium on mass storage systems and technologies (MSST).IEEE,2010:1-10.
[5] DELIMTROU C,SANCHEZ D,KOZYRAKIS C.Tarcil:reconciling scheduling speed and quality in large shared clusters[C]∥Proceedings of the Sixth ACM Symposium on Cloud Computing.ACM,2015:97-110.
[6] SCHWARZKOPF M,KONWINSKI A,ABD-EL-MALEK M,et al.Omega:flexible,scalable schedulers for large compute clusters[C]∥Proceedings of the 8th ACM European Conference on Computer Systems.ACM,2013:351-364.
[7] ZAHARIA M,BORTHAKUR D,SEN SARMA J,et al.Delay scheduling:a simple technique for achieving locality and fairness in cluster scheduling[C]∥Proceedings of the 5th European conference on Computer systems.ACM,2010:265-278.
[8] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[9] BEZERRA A,HERNNDEZ P,Espinosa A,et al.Job scheduling for optimizing data locality in Hadoop clusters[C]∥Proceedings of the 20th European MPI Users’ Group Meeting.ACM,2013:271-276.
[10] Kubernetes.http://kubernetes.io/
[11] HUANG S,HUANG J,DAI J,et al.The HiBench benchmark suite:Characterization of the MapReduce-based data analysis[C]∥2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW).IEEE,2010:41-51.
[12] Apache Hadoop官方网站.https://hadoop.apache.org/
[13] 浙江大学SEL实验室.Docker——容器与容器云[M].北京:人民邮电出版社,2016.
[14] 孙瑞琦,杨杰,高瞻,等.一种提高虚拟化Hadoop系统数据本地性的资源调度方法[J].计算机研究与发展,2014(S2):189-198.
[15] Docker.https://www.docker.com.
[16] REY J,Cogorno M,Nesmachnow S,et al.Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop[C]∥IEEE International Conference on Cloud Enginee-ring.2015:369-376.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

共享集群基于HDFS的数据块密度调度策略

Data Block Density Scheduling Strategy Based on HDFS in Shared Cluster

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0