Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (23): 53-60.DOI: 10.3778/j.issn.1002-8331.2001-0238

Previous Articles     Next Articles

Research on Fault Tolerant Clustering Algorithm of Scientific Workflow Considering Load Balancing

GAO Weijun, ZHANG Chunxia, YANG Jie, SHI Yang   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2020-12-01 Published:2020-11-30

考虑负载平衡的科学工作流容错聚类算法研究

高玮军,张春霞,杨杰,师阳   

  1. 兰州理工大学 计算机与通信学院,兰州 730050

Abstract:

In the process of scientific workflow execution, a cluster job composed of multiple tasks has a higher risk of failure than a single task. The fault-tolerant clustering algorithm is faced with load imbalance problems during fault recovery. A Balanced Re-clustering(BR) algorithm is proposed for this purpose. This algorithm combines Horizontal Runtime Balancing(HRB) and Selective Re-clustering(SR) to assign the longest running task to the shortest running class, after re-running the failed task. The experimental results show that compared with the two existing task re-clustering methods, the performance gain of the BR algorithm is up to 84% and 18.75%, respectively, which significantly reduces the workflow execution cost and improves the system’s operating efficiency.

Key words: task clustering, scientific workflow, system overhead, fault tolerance algorithm, balance clustering

摘要:

科学工作流执行过程中,多个任务组成的聚类作业相对单任务故障风险更高。容错聚类算法在进行故障恢复的同时面临着负载不平衡问题,为此提出了一种平衡重聚类算法(Balanced Re-clustering,BR)。该算法结合水平运行时间平衡聚类算法(Horizontal Runtime Balancing,HRB)对选择重聚类方法(Selective Re-clustering,SR)进行改进,将运行时间最长的任务分配给运行时间最短的类,在故障发生后重新运行失败的任务。实验结果表明,与现有的两种任务重聚类方法相比,BR算法的性能增益最高分别可达84%和18.75%,显著降低了工作流执行成本,提高了系统的运行效率。

关键词: 任务聚类, 科学工作流, 系统开销, 容错算法, 平衡聚类