基于Charm++运行时环境的异构计算应用容错研究

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (13): 1-7.

基于Charm++运行时环境的异构计算应用容错研究

孟晨1，2，曹宗雁1，王龙1，迟学斌1

1.中国科学院计算机网络信息中心超级计算中心，北京 100190
2.中国科学院大学，北京 100049

出版日期:2016-07-01 发布日期:2016-07-15

Charm++ RTS based fault tolerance mechanism of heterogeneous computing

MENG Chen1，2, CAO Zongyan1, WANG Long1, CHI Xuebin1

1.Supercomputing Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
2.University of Chinese Academy of Sciences, Beijing 100049, China

Online:2016-07-01 Published:2016-07-15

摘要/Abstract

摘要： 容错问题是大规模并行程序长时间运行中不可回避的问题，超级计算机中异构计算部件的加入使得该问题更加复杂。考察由CPU和GPU组成的异构并行系统中应用程序的容错，利用Charm++并行编程模型和CUDA的并行计算架构，对大规模计算宇宙学软件WIGEON进行重构。针对异构并行系统中存在的fail-stop硬件故障，设计并实现了内存检查点的应用容错机制。支持计算恢复后对产生变化的CPU/GPU资源配置进行自适应负载调整。通过在高性能计算机Mole8.5上的实验和分析，验证了异构容错方案的高效性和可行性，故障恢复时间仅需1~4 s。此外，使用分布式冗余数据改进了Charm++现有内存检查点存储模式，对比原有Double-in-Memory机制，性能未受影响，且最多降低了50%的额外内存使用量。

关键词: 容错, 异构, 无盘检查点, Charm++, 负载均衡, 分布式冗余

Abstract: Fault tolerance is an inevitable issue for long-running large-scale applications. Heterogeneous devices make the reliability problem more extrude. Focusing on the hardware failures, this paper presents a fault-tolerant mechanism in the heterogeneous clusters. It’s also implemented in the large-scale parallelization software of the cosmological fluid simulation in WIGEON based on the parallel pattern of Charm++ RTS and CUDA. What’s more, it’s combined with dynamic load-balancing for adapting the unbalanced computing node configuration after failures. Through the experiments and analysis on Mole8.5, it validates the efficiency and feasibility of this algorithm, and the recovery time only takes 1~4 seconds, and it also uses the distributed redundant data to improve Double-in-Memory checkpoint algorithm. The new fault-tolerant algorithm reduces memory footprint by 50% at the most without extra performance loss.

Key words: fault tolerance, heterogeneous, in-memory checkpoint, Charm++, load-balancing, distributed redundant

孟晨1，2，曹宗雁1，王龙1，迟学斌1. 基于Charm++运行时环境的异构计算应用容错研究[J]. 计算机工程与应用, 2016, 52(13): 1-7.

MENG Chen1，2, CAO Zongyan1, WANG Long1, CHI Xuebin1. Charm++ RTS based fault tolerance mechanism of heterogeneous computing[J]. Computer Engineering and Applications, 2016, 52(13): 1-7.

[1]	贾香恩，董一鸿，朱锋，钱江波. 异构图卷积网络研究进展[J]. 计算机工程与应用, 2021, 57(9): 36-49.
[2]	许小媛，李海波，黄黎. 云存储多异构文件联合延迟尾概率凸优化分析[J]. 计算机工程与应用, 2021, 57(5): 88-94.
[3]	李雅，侯彦东，刘畅. 基于故障程度的自适应优化容错控制[J]. 计算机工程与应用, 2021, 57(23): 295-302.
[4]	陈万芬，王宇嘉，林炜星. 异构集成代理辅助多目标粒子群优化算法[J]. 计算机工程与应用, 2021, 57(23): 71-80.
[5]	马满福，郭晨彪，李勇，张钟颖，张强，王常青. 基于结构熵的注意力流网络异构性研究[J]. 计算机工程与应用, 2021, 57(23): 98-105.
[6]	张雪婷，程华，房一泉. 基于元路径与节点属性的合著关系预测[J]. 计算机工程与应用, 2021, 57(2): 164-169.
[7]	王日宏，邢聪颖，徐泉清，袁杉杉. 具有监督机制的高效拜占庭容错算法[J]. 计算机工程与应用, 2021, 57(18): 142-148.
[8]	李健，张大伟，姜晓明，向立云. 并行化洪水演进模拟研究综述[J]. 计算机工程与应用, 2021, 57(13): 1-7.
[9]	王保剑，胡大裟，蒋玉明. 改进A*算法在路径规划中的应用[J]. 计算机工程与应用, 2021, 57(12): 243-247.
[10]	张宇，张延松. 向量分组聚集计算技术研究[J]. 计算机工程与应用, 2021, 57(11): 84-94.
[11]	熊霖，唐万梅. 基于异构分类器集成的增量学习算法[J]. 计算机工程与应用, 2020, 56(7): 155-161.
[12]	杨捷，吴素萍. 点云重建的并行算法[J]. 计算机工程与应用, 2020, 56(6): 213-219.
[13]	袁洋，叶峰，赖乙宗，赵雨亭. 结合负载均衡与A*算法的多AGV路径规划[J]. 计算机工程与应用, 2020, 56(5): 251-256.
[14]	王伟成，肖琨. 无线异构EH蜂窝网络中移动台关联的性能分析[J]. 计算机工程与应用, 2020, 56(3): 121-126.
[15]	王夫森，李志淮，田娜. 提升分片规模和有效性的多轮PBFT验证方案[J]. 计算机工程与应用, 2020, 56(24): 102-108.