It is generally observed that the fraction of live lines in shared last-level caches (SLLC) is very small for chip multiprocessors (CMPs). This can be tackled using promotion-based replacement policies like re-reference interval prediction (RRIP) instead of LRU, dead-block predictors, or reuse-based cache allocation schemes. In GPU systems, similar LLC issues are alleviated using various cache bypassing techniques. These issues are worsened in heterogeneous CPU-GPU systems because the two processors have different data access patterns and frequencies. GPUs generally work on streaming data, but have many more threads accessing memory as compared to CPUs. As such, most traditional cache replacement and allocation policies prove ineffective due to the higher number of cache accesses in GPU applications, resulting in higher allocation for GPU cache lines, despite their minimal reuse. In this work, we implement the Reuse Cache approach for heterogeneous CPU-GPU systems. The reuse cache is a decoupled tag/data SLLC which is designed to only store the data that is being accessed more than once. This design is based on the observation that most of the cache lines in the LLC are stored but do not get reused before being replaced. We find that the reuse cache achieves within 0.5% of the IPC gains of a statically partitioned LLC, while decreasing the area cost of the LLC by an average of 40%.
翻译:一般认为,共享最后一级缓存(SLLC)中的活线条对于芯片多处理器(CMPs)来说非常小。这可以通过基于升级的替换政策来解决,如重参考间隔预测(RRIP)而不是LRU、死区预测器或再利用缓存分配计划等基于升级的替换政策。在GPU系统中,类似的LLC问题利用各种缓存绕路技术得到缓解。在混杂的 CPU-GPU系统中,这些问题更加恶化,因为两个处理器有不同的数据访问模式和频率。GPUs通常在流数据上工作,但与CPU相比,存取记忆的线条更多。因此,大多数传统的缓存替换和分配政策证明无效,因为GPU的缓存访问量较多,导致对GPU的缓存线的划拨量增加,尽管它们很少得到再利用。在这项工作中,我们为混杂的 CPU-GPUS系统实施了重新使用缓存办法。再利用缓存标签/数据SLLLC是一种分解的标签/数据,其设计仅存储数据。这个设计的基础是40级的缓存区域C在存储器中,而我们正在逐渐更换的缓存区域线段中,而没有在最小化的递置成本中,我们正在逐渐取代了最大再利用。