Computer Science ›› 2017, Vol. 44 ›› Issue (Z11): 510-515.doi: 10.11896/j.issn.1002-137X.2017.11A.108

Previous Articles     Next Articles

Data Block Density Scheduling Strategy Based on HDFS in Shared Cluster

DU Hong-guang, LEI Zhou and CHEN Sheng-bo   

  • Online:2018-12-01 Published:2018-12-01

Abstract: With the development of cloud computing technology and mass data processing technology,shared clusters use HDFS as a distributed file system and manage computing resources through virtualization to provide operational resources for computing frameworks and applications.The data localization of mass data processing applications is a key factor which affects its performance.At present,the research of shared cluster management framework’s scheduler mainly focuses on improving the throughput and resource utilization of the system by improving the parallelism of dispatching,and there are some defects in the quality of scheduling,such as the data locality.In this paper,a scheduling strategy based on data block density was proposed to improve the data locality of the application.By using this strategy,the performance of the application can be improved by reducing the cross-host I/O during the application operation.Experiments show that the scheduling strategy proposed in this paper can effectively reduce the running time of data-intensive operations.In the test case of WordCount and TeraSort with 2.5G data,the method of this paper achieved 90% data localization and shortened the operation by 20% time.

Key words: HDFS,Data block density,Shared cluster,Scheduling strategy

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[2] WHITE T.Hadoop:The definitive guide[M].O’Reilly Media,Inc.,2012.
[3] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:cluster computing with working sets[J].HotCloud,2010,15(1):10.
[4] SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th symposium on mass storage systems and technologies (MSST).IEEE,2010:1-10.
[5] DELIMTROU C,SANCHEZ D,KOZYRAKIS C.Tarcil:reconciling scheduling speed and quality in large shared clusters[C]∥Proceedings of the Sixth ACM Symposium on Cloud Computing.ACM,2015:97-110.
[6] SCHWARZKOPF M,KONWINSKI A,ABD-EL-MALEK M,et al.Omega:flexible,scalable schedulers for large compute clusters[C]∥Proceedings of the 8th ACM European Conference on Computer Systems.ACM,2013:351-364.
[7] ZAHARIA M,BORTHAKUR D,SEN SARMA J,et al.Delay scheduling:a simple technique for achieving locality and fairness in cluster scheduling[C]∥Proceedings of the 5th European conference on Computer systems.ACM,2010:265-278.
[8] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[9] BEZERRA A,HERNNDEZ P,Espinosa A,et al.Job scheduling for optimizing data locality in Hadoop clusters[C]∥Proceedings of the 20th European MPI Users’ Group Meeting.ACM,2013:271-276.
[10] Kubernetes.http://kubernetes.io/
[11] HUANG S,HUANG J,DAI J,et al.The HiBench benchmark suite:Characterization of the MapReduce-based data analysis[C]∥2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW).IEEE,2010:41-51.
[12] Apache Hadoop官方网站.https://hadoop.apache.org/
[13] 浙江大学SEL实验室.Docker——容器与容器云[M].北京:人民邮电出版社,2016.
[14] 孙瑞琦,杨杰,高瞻,等.一种提高虚拟化Hadoop系统数据本地性的资源调度方法[J].计算机研究与发展,2014(S2):189-198.
[15] Docker.https://www.docker.com.
[16] REY J,Cogorno M,Nesmachnow S,et al.Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop[C]∥IEEE International Conference on Cloud Enginee-ring.2015:369-376.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!