Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.
翻译:长期延迟的负载要求继续限制高性能处理器的性能。为了提高一个处理器的耐久度,建筑师主要依靠两种关键技术:先进的数据预发器和大型芯片缓存。在这项工作中,我们显示:(1) 甚至一个先进的先进预发器只能平均预测一半的离船负载要求, 其范围很广, 2) 由于芯片缓存器的大小和复杂性越来越大, 大量离船装载要求的延缓度被花费在进入芯片缓冲系统上。 这项工作的目标是通过从临界路径上去除离船缓存的缓存连接器连接器。 为此,我们提出一个新的技术, 其关键想法是:1) 准确预测负载请求的状态和复杂性, 2) 投机性地从主记忆中直接获取数据, 同时用大量离船尾缓存的存储器仓储系统。 因此, 母舰的运行程序是让母座预留置的运行程序 。 我们开发一个新的技术, 将启动一个运行程序。