计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (12): 85-91.DOI: 10.3778/j.issn.1002-8331.1606-0108

• 大数据与云计算 • 上一篇    下一篇

改进的Hadoop作业调度算法

冯兴杰,贺  阳   

  1. 中国民航大学 计算机科学与技术学院,天津 300300
  • 出版日期:2017-06-15 发布日期:2017-07-04

Improvement of job scheduling algorithm on Hadoop

FENG Xingjie, HE Yang   

  1. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Online:2017-06-15 Published:2017-07-04

摘要: 分布式集群普遍存在负载均衡问题,而Hadoop没有考虑到节点间性能的差异.虽然有负载均衡机制,但是效果不太理想,因此运行过程中经常会出现负载不均衡的情况。针对如上问题,深入分析了Hadoop源代码,理清了Hadoop的运行原理,在Hadoop资源管理机制Yarn中改进了Hadoop任务的排序,建立了新的任务排序规则,提出了对各节点性能评价的指标,分为动态性能指标和静态性能指标。在此基础上对Yarn的FairScheduler算法进行了改进,形成了考虑节点性能的调度算法。重新对Hadoop源码进行了编译,在所搭建的Hadoop平台上进行了对比实验,证明了加入节点性能指标有效解决了Hadoop负载均衡问题,对Hadoop的运行效率有了很大提高。

关键词: 大数据, Hadoop, Yarn, 负载均衡, FairScheduler算法

Abstract: Distributed cluster has the problem of load balancing, and the Hadoop does not take into account the differences in the performance of the nodes. Although it has a load balancing mechanism, the effect is not ideal. As a result, there is often a load imbalance in the process of running. In view of the above problem, this paper has in-depth analysis of the Hadoop source code, to clarify of hadoop principle, and improves Hadoop task scheduling in Yarn which is resource management mechanism of Hadoop. Then establishes new task scheduling rules, and also proposes a performance evaluation index for each node, performance evaluation includes dynamic performance and static performance. On the basis of this, this paper improves FairScheduler algorithm of Yarn, and forms a scheduling algorithm considering the performance of nades. To recompile the Hadoop source code, and comparative experiment which carries out on the Hadoop platform, and proves the performance index of the join node can effectively solve the problem of Hadoop load balancing, greatly improves of running efficiency on Hadoop.

Key words: big data, Hadoop, Yarn, load balancing, FairScheduler algorithm