Runahead execution is a technique to mask memory latency caused by irregular memory accesses. By pre-executing the application code during occurrences of long-latency operations and prefetching anticipated cache-missed data into the cache hierarchy, runahead effectively masks memory latency for subsequent cache misses and achieves high prefetching accuracy; however, this technique has been limited to superscalar out-of-order and superscalar in-order cores. For implementation in scalar in-order cores, the challenges of area-/energy-constraint and severe cache contention remain. Here, we build the first full-stack system featuring runahead, MERE, from SoC and a dedicated ISA to the OS and programming model. Through this deployment, we show that enabling runahead in scalar in-order cores is possible, with minimal area and power overheads, while still achieving high performance. By re-constructing the sequential runahead employing a hardware/software co-design approach, the system can be implemented on a mature processor and SoC. Building on this, an adaptive runahead mechanism is proposed to mitigate the severe cache contention in scalar in-order cores. Combining this, we provide a comprehensive solution for embedded processors managing irregular workloads. Our evaluation demonstrates that the proposed MERE attains 93.5% of a 2-wide out-of-order core's performance while constraining area and power overheads below 5%, with the adaptive runahead mechanism delivering an additional 20.1% performance gain through mitigating the severe cache contention issues.
翻译:暂无翻译