Charm++ RTS based fault tolerance mechanism of heterogeneous computing

Abstract

Abstract: Fault tolerance is an inevitable issue for long-running large-scale applications. Heterogeneous devices make the reliability problem more extrude. Focusing on the hardware failures, this paper presents a fault-tolerant mechanism in the heterogeneous clusters. It’s also implemented in the large-scale parallelization software of the cosmological fluid simulation in WIGEON based on the parallel pattern of Charm++ RTS and CUDA. What’s more, it’s combined with dynamic load-balancing for adapting the unbalanced computing node configuration after failures. Through the experiments and analysis on Mole8.5, it validates the efficiency and feasibility of this algorithm, and the recovery time only takes 1~4 seconds, and it also uses the distributed redundant data to improve Double-in-Memory checkpoint algorithm. The new fault-tolerant algorithm reduces memory footprint by 50% at the most without extra performance loss.

Key words: fault tolerance, heterogeneous, in-memory checkpoint, Charm++, load-balancing, distributed redundant

摘要： 容错问题是大规模并行程序长时间运行中不可回避的问题，超级计算机中异构计算部件的加入使得该问题更加复杂。考察由CPU和GPU组成的异构并行系统中应用程序的容错，利用Charm++并行编程模型和CUDA的并行计算架构，对大规模计算宇宙学软件WIGEON进行重构。针对异构并行系统中存在的fail-stop硬件故障，设计并实现了内存检查点的应用容错机制。支持计算恢复后对产生变化的CPU/GPU资源配置进行自适应负载调整。通过在高性能计算机Mole8.5上的实验和分析，验证了异构容错方案的高效性和可行性，故障恢复时间仅需1~4 s。此外，使用分布式冗余数据改进了Charm++现有内存检查点存储模式，对比原有Double-in-Memory机制，性能未受影响，且最多降低了50%的额外内存使用量。

关键词: 容错, 异构, 无盘检查点, Charm++, 负载均衡, 分布式冗余

MENG Chen1，2, CAO Zongyan1, WANG Long1, CHI Xuebin1. Charm++ RTS based fault tolerance mechanism of heterogeneous computing[J]. Computer Engineering and Applications, 2016, 52(13): 1-7.

孟晨1，2，曹宗雁1，王龙1，迟学斌1. 基于Charm++运行时环境的异构计算应用容错研究[J]. 计算机工程与应用, 2016, 52(13): 1-7.

[1]	JIA Xiang’en, DONG Yihong, ZHU Feng, QIAN Jiangbo. Research Progress of Heterogeneous Graph Convolutional Networks [J]. Computer Engineering and Applications, 2021, 57(9): 36-49.
[2]	YU Lei, XU Guangluan, WANG Yang, LIN Daoyu, LI Feng. Research on Multidimensional Visualization of Heterogeneous Network Based on Dynamic Projection Embedding [J]. Computer Engineering and Applications, 2021, 57(8): 145-152.
[3]	ZHANG Dieyi, YIN Lijie. Clustering-Preserving Representation Learning on Heterogeneous Network [J]. Computer Engineering and Applications, 2021, 57(7): 144-150.
[4]	XU Xiaoyuan, LI Haibo, HUANG Li. Convex Optimization Analysis of Joint Delay Tail Probability of Multi-heterogeneous Files in Cloud Storage [J]. Computer Engineering and Applications, 2021, 57(5): 88-94.
[5]	CHEN Shiming, LIN Zipeng, GAO Yanli, PEI Huiqin. Heterogeneous Group Consensus Under Adaptive Coupling Weights [J]. Computer Engineering and Applications, 2021, 57(4): 231-235.
[6]	CHEN Wanfen, WANG Yujia, LIN Weixing. Heterogeneous Ensemble Surrogate Assisted Multi-objective Particle Swarm Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(23): 71-80.
[7]	ZHANG Jie, ZHANG Yueqin, ZHANG Zehua, LIU Zhixin, LEI Xiang. Attention Preference Recommendation Methods with Fusing Network Embedding in Heterogeneous Information [J]. Computer Engineering and Applications, 2021, 57(21): 123-131.
[8]	ZHANG Xueting, CHENG Hua, FANG Yiquan. Prediction of Co-authorship Based on Meta-Path and Node Attributes [J]. Computer Engineering and Applications, 2021, 57(2): 164-169.
[9]	WANG Rihong, XING Congying, XU Quanqing, YUAN Shanshan. Efficient Byzantine Fault Tolerant Algorithm with Supervision Mechanism [J]. Computer Engineering and Applications, 2021, 57(18): 142-148.
[10]	ZHOU Ruiye, SHEN Wenzhong. PI-Unet：Research on Precise Iris Segmentation Neural Network Model for Heterogeneous Iris [J]. Computer Engineering and Applications, 2021, 57(15): 223-229.
[11]	LI Jian, ZHANG Dawei, JIANG Xiaoming, XIANG Liyun. Review on Parallelized Flood Inundation Models [J]. Computer Engineering and Applications, 2021, 57(13): 1-7.
[12]	ZHAO Manyu, YE Jun. Quasi-synchronization Control for Heterogeneous Networks with Time Delays [J]. Computer Engineering and Applications, 2021, 57(12): 86-92.
[13]	ZHANG Yu, ZHANG Yansong. Research on Vector Grouping Aggregation Technology [J]. Computer Engineering and Applications, 2021, 57(11): 84-94.
[14]	XIONG Lin, TANG Wanmei. Incremental Learning Algorithm Based on Heterogeneous Classifier Ensemble [J]. Computer Engineering and Applications, 2020, 56(7): 155-161.
[15]	YANG Jie, WU Suping. Parallel Algorithm for Point Cloud Reconstruction [J]. Computer Engineering and Applications, 2020, 56(6): 213-219.

Charm++ RTS based fault tolerance mechanism of heterogeneous computing

基于Charm++运行时环境的异构计算应用容错研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics