Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (9): 33-36.

Previous Articles     Next Articles

Fault tolerance in real-time and multitask parallel computing system

XU Xiaodong, ZHAO Jianting, XU Chunlei   

  1. Jiangsu Automation Research Institute, Lianyungang, Jiangsu 222006, China
  • Online:2013-05-01 Published:2016-03-28

实时多任务并行计算系统的容错技术

徐晓东,赵建亭,许春雷   

  1. 江苏自动化研究所,江苏 连云港 222006

Abstract: Fault tolerance plays a key role in the design of real-time and multitask parallel computing systems. Aiming at the request of high reliability and efficiency in the real-time and multitask parallel computing system, the basic concepts, basic methods and basic thoughts in the technology of reliability and fault tolerance of computing system are introduced, based on the check-pointing technology and back-out recovery technology. Fault-tolerance parallel computing system from multi-levels and multi-aspects and the solving way of midway message and isolated message are put forward. At the same time, the relate model and technology evaluating are discussed to prove the validity of the model.

Key words: real-time and multi-task, fault tolerance, checkpoint, multi-levels

摘要: 容错技术是实时多任务并行计算系统设计中必须解决的一个关键难点。针对实时多任务并行计算系统的高可靠性和高效性的要求,介绍了计算机系统可靠性和容错技术的基本概念、基本方法和基本思想,在检查点技术和卷回技术的基础上,提出了进行多层次、多角度的并行容错计算机系统设计和解决中途消息和孤立消息的相关方案,给出了相应的模型和技术评估,通过仿真实验证明了该模型的有效性。

关键词: 实时多任务, 容错, 检查点, 多层次