Two-Phase TMR conserves energy by partitioning redundancy operations into two stages and making the execution of the third task copy optional, yet it remains susceptible to permanent faults. Reactive-TMR (R-TMR) counters this by isolating faulty cores, handling both transient and permanent faults. However, the lightweight hardware required by R-TMR not only increases complexity but also becomes a single point of failure itself. To bypass isolated node constraints, this paper proposes a Fault Tolerance and Isolation TMR (FTI-TMR) algorithm for interconnected multicore systems. By constructing a stability metric to identify the most reliable nodes in the system, which then perform periodic diagnostics to isolate permanent faults. Experimental results show that FTI-TMR reduces task workload by approximately 30% compared with baseline TMR while achieving higher permanent fault coverage.
翻译:两阶段三模冗余(Two-Phase TMR)通过将冗余操作划分为两个阶段并使第三任务副本的执行变为可选,以节约能耗,但其仍易受永久性故障影响。反应式三模冗余(R-TMR)通过隔离故障核心来应对此问题,能同时处理瞬态与永久性故障。然而,R-TMR所需的轻量级硬件不仅增加了系统复杂性,其本身也可能成为单点故障源。为突破隔离节点的限制,本文提出一种面向互连多核系统的容错与隔离三模冗余(FTI-TMR)算法。该算法通过构建稳定性度量指标来识别系统中最可靠的节点,并由这些节点执行周期性诊断以隔离永久性故障。实验结果表明,与基准TMR相比,FTI-TMR在实现更高永久性故障覆盖率的同时,能将任务负载降低约30%。