Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows engaging partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves an average reduction of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.
翻译:图形处理器( GPU) 被各种应用在广泛的领域广泛广泛使用, 以加快计算速度, 但仍易发生瞬时硬件故障( 软错误), 容易影响应用输出。 我们利用一般目的 GPU 应用分级组织在线条、 扭曲器和合作线条阵列中, 提议了一种方法, 用以识别线条的弹性, 并用相同的弹性特性绘制线条到同一个扭曲器。 这样可以使用部分复制机制在扭曲级别探测/校正错误。 通过从 4 个基准套中探索 12 个基准( 17 内核 ), 我们说明线条可以重新绘制成可靠或不可靠的扭曲点( 平均), 只引入1.63% 的管理费, 然后通过复制真正需要它的那些线条来进行选择性保护 。 此外, 我们显示, 向不同的线条重新绘制不会牺牲应用程序的性能。 我们展示了这种重新绘制如何促进在错误检测和/ 校正点上进行重复制, 并实现平均减少 20.61% 和27. 15% 执行周期, 。