Design and implementation of fault tolerant mechanism in parallel C programming language

doi:10.3778/j.issn.1002-8331.1801-0132

Abstract

Abstract: Heterogeneous many-core computer system has become the development direction of supercomputer, due to its outstanding advantages like high computational ability and high performance. However, the complex heterogeneous architecture and huge system scale have brought great challenges to the system usability. Therefore, the research of lightweight fault tolerant techniques in heterogeneous many-core system is of great significance. In terms of the large time cost problem in classical system level fault tolerance, two language supported fault tolerance mechanisms are designed and realized in Parallel C language, namely lightweight degrade via local fault sensing and checkpoint technique by automatic analysis together with complier guidance. The local fault sensing lightweight degrade is realized by dynamic task scheduling framework, which is suitable for the many-core system and can be expanded to parallel computing with millions scales. The checkpoint technique by automatic analysis and complier guidance can reduce the total amount of retained and restored data by simple complier guidance which uses complier to analyze the unnecessary data. The experimental results in Sunway Taihu Light supercomputer show that, in comparison with the classical fault tolerance methods, the proposed fault tolerance mechanisms have good performance. The fault tolerance cost rate of the lightweight degrade is less than 1 percent, which is 3.5 percent less than the traditional method of rewinding fault tolerant in one fault execution time. The restored data in the proposed checkpoint technique in the typical application can be reduced to 1/10, which shows good practicality.

Key words: fault tolerant, degrade, checkpoint, Parallel C language

摘要： 大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点，已成为超级计算机的发展方向，但其复杂的异构结构和庞大的系统规模，也使系统的可用性面临巨大挑战，因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题，在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制，兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现，支持众核系统，可扩展到百万以上并行规模；编译指导与自动分析的检查点通过程序员插入简单的编译指示，由编译器进行分析，提示不需要保留的数据，可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明，两种容错措施相对于传统容错方法效果良好，轻量级降级的容错开销小于1%，相对于传统回卷容错方法单次故障执行时间可减少3.5%以上，编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10，具有很好的实用性。

关键词: 容错, 降级, 检查点, Parallel C语言

HE Wangquan1, FANG Yanfei1, WEI Di1, DONG Enming1, QI Fengbin2. Design and implementation of fault tolerant mechanism in parallel C programming language[J]. Computer Engineering and Applications, 2018, 54(17): 41-49.

何王全1，方燕飞1，魏迪1，董恩铭1，漆锋滨2. Parallel C语言级容错机制的设计与实现[J]. 计算机工程与应用, 2018, 54(17): 41-49.

[1]	LI Ya, HOU Yandong, LIU Chang. Adaptive Optimal Fault-Tolerant Control Based on Fault Degree [J]. Computer Engineering and Applications, 2021, 57(23): 295-302.
[2]	WANG Rihong, XING Congying, XU Quanqing, YUAN Shanshan. Efficient Byzantine Fault Tolerant Algorithm with Supervision Mechanism [J]. Computer Engineering and Applications, 2021, 57(18): 142-148.
[3]	LIU Yang, YANG Jinmin. Optimizing checkpoint on basis of active variable analysis in OpenMP programs [J]. Computer Engineering and Applications, 2016, 52(4): 31-41.
[4]	MENG Chen1，2, CAO Zongyan1, WANG Long1, CHI Xuebin1. Charm++ RTS based fault tolerance mechanism of heterogeneous computing [J]. Computer Engineering and Applications, 2016, 52(13): 1-7.
[5]	XU Xiaodong, ZHAO Jianting, XU Chunlei. Fault tolerance in real-time and multitask parallel computing system [J]. Computer Engineering and Applications, 2013, 49(9): 33-36.
[6]	LIU Xinjuan, SUN Wen’an, LI Pixian, PEI Bingnan. H-infinity fault tolerant guaranteed cost control for a class of network control system with time delay [J]. Computer Engineering and Applications, 2013, 49(24): 224-228.
[7]	LIANG Jin-ye¹，LIANG Jia-rong². Research of fault tolerant routing algorithm on exchanged hypercube networks [J]. Computer Engineering and Applications, 2010, 46(32): 24-28.
[8]	LI Yin,LIANG Jia-rong. Fault tolerant routing algorithm on hypercube networks [J]. Computer Engineering and Applications, 2009, 45(18): 120-122.
[9]	LI Yin,LIANG Jia-rong,XU Shuang,XIAO Jie. Design of fault tolerant routing algorithm and probabilistic analysis on Torus networks [J]. Computer Engineering and Applications, 2009, 45(14): 103-106.
[10]	YAN Yu-qi¹,GAO Tai-ping^1,2. Research in fault tolerance of crossedcube networks [J]. Computer Engineering and Applications, 2009, 45(11): 95-96.
[11]	HONG Han-yu. Fast restoration for turbulence-degraded images based on second-order weighted difference [J]. Computer Engineering and Applications, 2008, 44(31): 15-19.
[12]	XIAO Jie,LIANG Jia-rong,XU Shuang,LI Yin. Research of two routings on 3-D mesh [J]. Computer Engineering and Applications, 2008, 44(17): 90-93.
[13]	hanyu hong. Research on Algorithm of Alternant Iterative blind Restoration Based on Conjugate Gradient Method of Frequency Domain [J]. Computer Engineering and Applications, 2007, 43(2期): 5-5.

Design and implementation of fault tolerant mechanism in parallel C programming language

Parallel C语言级容错机制的设计与实现

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 13

Recommended Articles

Metrics