Computer Science ›› 2017, Vol. 44 ›› Issue (Z11): 510-515.doi: 10.11896/j.issn.1002-137X.2017.11A.108

Previous Articles     Next Articles

Data Block Density Scheduling Strategy Based on HDFS in Shared Cluster

DU Hong-guang, LEI Zhou and CHEN Sheng-bo   

  • Online:2018-12-01 Published:2018-12-01

Abstract: With the development of cloud computing technology and mass data processing technology,shared clusters use HDFS as a distributed file system and manage computing resources through virtualization to provide operational resources for computing frameworks and applications.The data localization of mass data processing applications is a key factor which affects its performance.At present,the research of shared cluster management framework’s scheduler mainly focuses on improving the throughput and resource utilization of the system by improving the parallelism of dispatching,and there are some defects in the quality of scheduling,such as the data locality.In this paper,a scheduling strategy based on data block density was proposed to improve the data locality of the application.By using this strategy,the performance of the application can be improved by reducing the cross-host I/O during the application operation.Experiments show that the scheduling strategy proposed in this paper can effectively reduce the running time of data-intensive operations.In the test case of WordCount and TeraSort with 2.5G data,the method of this paper achieved 90% data localization and shortened the operation by 20% time.

Key words: HDFS,Data block density,Shared cluster,Scheduling strategy

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[2] WHITE T.Hadoop:The definitive guide[M].O’Reilly Media,Inc.,2012.
[3] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:cluster computing with working sets[J].HotCloud,2010,15(1):10.
[4] SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th symposium on mass storage systems and technologies (MSST).IEEE,2010:1-10.
[5] DELIMTROU C,SANCHEZ D,KOZYRAKIS C.Tarcil:reconciling scheduling speed and quality in large shared clusters[C]∥Proceedings of the Sixth ACM Symposium on Cloud Computing.ACM,2015:97-110.
[6] SCHWARZKOPF M,KONWINSKI A,ABD-EL-MALEK M,et al.Omega:flexible,scalable schedulers for large compute clusters[C]∥Proceedings of the 8th ACM European Conference on Computer Systems.ACM,2013:351-364.
[7] ZAHARIA M,BORTHAKUR D,SEN SARMA J,et al.Delay scheduling:a simple technique for achieving locality and fairness in cluster scheduling[C]∥Proceedings of the 5th European conference on Computer systems.ACM,2010:265-278.
[8] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[9] BEZERRA A,HERNNDEZ P,Espinosa A,et al.Job scheduling for optimizing data locality in Hadoop clusters[C]∥Proceedings of the 20th European MPI Users’ Group Meeting.ACM,2013:271-276.
[10] Kubernetes.
[11] HUANG S,HUANG J,DAI J,et al.The HiBench benchmark suite:Characterization of the MapReduce-based data analysis[C]∥2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW).IEEE,2010:41-51.
[12] Apache Hadoop官方网站.
[13] 浙江大学SEL实验室.Docker——容器与容器云[M].北京:人民邮电出版社,2016.
[14] 孙瑞琦,杨杰,高瞻,等.一种提高虚拟化Hadoop系统数据本地性的资源调度方法[J].计算机研究与发展,2014(S2):189-198.
[15] Docker.
[16] REY J,Cogorno M,Nesmachnow S,et al.Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop[C]∥IEEE International Conference on Cloud Enginee-ring.2015:369-376.

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .