The irregular nature of memory accesses of graph workloads makes their performance poor on modern computing platforms. On manycore reconfigurable architectures (MRAs), in particular, even state-of-the-art graph prefetchers do not work well (only 3% speedup), since they are designed for traditional CPUs. This is because caches in MRAs are typically not large enough to host a large quantity of prefetched data, and many employs shared caches that such prefetchers simply do not support. This paper studies the design of a data prefetcher for an MRA called Transmuter. The prefetcher is built on top of Prodigy, the current best-performing data prefetcher for CPUs. The key design elements that adapt the prefetcher to the MRA include fused prefetcher status handling registers and a prefetch handshake protocol to support run-time reconfiguration, in addition, a redesign of the cache structure in Transmuter. An evaluation of popular graph workloads shows that synergistic integration of these architectures outperforms a baseline without prefetcher by 1.27x on average and by as much as 2.72x on some workloads.
翻译:图形工作量的记忆存取不规则性使其在现代计算平台上的性能差。 在许多可重新配置的多极结构( MIRAs) 上, 特别是, 即使是最先进的图形预推器也效果不好( 只加快3% ), 因为它们是为传统的 CPU 设计的。 这是因为, IMRA 中的缓存通常不够大, 无法容纳大量预发数据, 并且许多使用共享缓存, 而这种预发器根本无法支持。 本文研究了一个称为 Transmuter 的 MRA 数据预设器的设计。 预设器建在Prodigy 的顶端, 即当前最优秀的图形预建于 CPU 的最佳数据预设器。 使预设器适应MRA 的关键设计要素包括连接的预发件人状态处理登记册和预发手握手协议, 以支持运行时间重组。 此外, 对 Transmuter 的缓存结构的重新设计。 对流行图表工作量的评估显示, 这些结构的协同整合超越了2.x 的平均工作量。