Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (17): 41-49.DOI: 10.3778/j.issn.1002-8331.1801-0132

Previous Articles     Next Articles

Design and implementation of fault tolerant mechanism in parallel C programming language

HE Wangquan1, FANG Yanfei1, WEI Di1, DONG Enming1, QI Fengbin2   

  1. 1.Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083, China
    2.National Research Center of Parallel Computer Engineering & Technology, Beijing 100080, China
  • Online:2018-09-01 Published:2018-08-30

Parallel C语言级容错机制的设计与实现

何王全1,方燕飞1,魏  迪1,董恩铭1,漆锋滨2   

  1. 1.江南计算技术研究所,江苏 无锡 214083
    2.国家并行计算机工程技术研究中心,北京 100080

Abstract: Heterogeneous many-core computer system has become the development direction of supercomputer, due to its outstanding advantages like high computational ability and high performance. However, the complex heterogeneous architecture and huge system scale have brought great challenges to the system usability. Therefore, the research of lightweight fault tolerant techniques in heterogeneous many-core system is of great significance. In terms of the large time cost problem in classical system level fault tolerance, two language supported fault tolerance mechanisms are designed and realized in Parallel C language, namely lightweight degrade via local fault sensing and checkpoint technique by automatic analysis together with complier guidance. The local fault sensing lightweight degrade is realized by dynamic task scheduling framework, which is suitable for the many-core system and can be expanded to parallel computing with millions scales. The checkpoint technique by automatic analysis and complier guidance can reduce the total amount of retained and restored data by simple complier guidance which uses complier to analyze the unnecessary data. The experimental results in Sunway Taihu Light supercomputer show that, in comparison with the classical fault tolerance methods, the proposed fault tolerance mechanisms have good performance. The fault tolerance cost rate of the lightweight degrade is less than 1 percent, which is 3.5 percent less than the traditional method of rewinding fault tolerant in one fault execution time. The restored data in the proposed checkpoint technique in the typical application can be reduced to 1/10, which shows good practicality.

Key words: fault tolerant, degrade, checkpoint, Parallel C language

摘要: 大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点,已成为超级计算机的发展方向,但其复杂的异构结构和庞大的系统规模,也使系统的可用性面临巨大挑战,因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题,在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制,兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现,支持众核系统,可扩展到百万以上并行规模;编译指导与自动分析的检查点通过程序员插入简单的编译指示,由编译器进行分析,提示不需要保留的数据,可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明,两种容错措施相对于传统容错方法效果良好,轻量级降级的容错开销小于1%,相对于传统回卷容错方法单次故障执行时间可减少3.5%以上,编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10,具有很好的实用性。

关键词: 容错, 降级, 检查点, Parallel C语言