Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (21): 17-22.

• 博士论坛 • Previous Articles     Next Articles

DoubleRun:using temporal redundancy to insure the reliability of processors

LIU Guanghui   

  1. School of Computer,National University of Defense Technology,Changsha 410073,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-07-21 Published:2011-07-21

使用时间冗余保证处理器的可靠性

刘光辉   

  1. 国防科技大学 计算机学院,长沙 410073

Abstract: This paper presents the framework of BRO-SOC(Backward Recovery Oriented Sphere Of Correctness) based on SOR model,and then under the guidance of this framework,proposes DoubleRun fault-tolerant processor scheme,which uses Deterministic Replay to ensure the reliability of processor.Since DoubleRun sets the fault detection boundary at SOC2 level,the transient faults in processor pipeline can be tolerated as long as L1 cache is properly extended.DoubleRun provides full fault coverage without modifying the processor pipeline,so the performance degradation of DoubleRun is less than that of other schemes.Part of SPEC2000 benchmarks is used to evaluate the fault-free performance of DoubleRun,and a metric called fault-tolerant Time and Area Cost(TAC) is proposed to compare DoubleRun horizontally with other schemes(DCC、Slipstream).The experiment result indicates that DoubleRun only spends 6.9% additional area and 89.8% more time to achieve full transient fault coverage.Although the TAC of DoubleRun is 7% bigger than that of Slipstream,it can provide full fault coverage;on the other hand,the TAC of DoubleRun is 14% less than that of DCC,with the same fault coverage provided.

Key words: transient fault, soft error, deterministic replay, processor reliability, temporal redundancy, Backward Recovery Oriented Sphere Of Correctness(BRO-SOC)

摘要: 在SOR模型的基础上提出了BRO-SOC(Backward Recovery Oriented Sphere Of Correctness)框架。在该框架的指导下提出了DoubleRun容错处理器方案。DoubleRun使用确定性重播(Deterministic Replay)技术保证处理器的可靠性。由于DoubleRun将故障的检测边界设置在BRO-SOC框架的SOC2一级,因此只需对L1 cache进行适当扩展即可容忍处理器流水线中的瞬态故障,由于它不需改动现有的处理器流水线,故相比于其他方案对处理器流水线的性能影响更小。利用SPEC2000的部分程序测试了DoubleRun的无故障性能。为将DoubleRun与其他容错方案(DCC、Slipstream)作横向比较而提出了衡量指标TAC(Time and Area Cost)。实验结果表明,DoubleRun在提供全故障覆盖率的情况下仅增加了6.9%的面积开销和89.8%的时间开销,其TAC虽然比Slipstream大7%但却可以提供全面的故障覆盖率,其故障覆盖能力与DCC相同但TAC却比后者小14%。

关键词: 瞬态故障, 软错误, 确定性重播, 处理器可靠性, 时间冗余, 面向向后恢复的正确域(BRO-SOC)