GPU shared L1 cache is a promising architecture while still suffering from high resource contentions. We present a GPU shared L1 cache architecture with an aggregated tag array that minimizes the L1 cache contentions and takes full advantage of inter-core locality. The key idea is to decouple and aggregate the tag arrays of multiple L1 caches so that the cache requests can be compared with all tag arrays in parallel to probe the replicated data in other caches. The GPU caches are only accessed by other GPU cores when replicated data exists, filtering out unnecessary cache accesses that cause high resource contentions. The experimental results show that GPU IPC can be improved by 12% on average for applications with a high inter-core locality.
翻译:GPU 共享 L1 缓存是一个充满希望的架构, 但仍受到高资源争论的影响。 我们展示了一个 GPU 共享 L1 缓存结构, 配有一个总标签阵列, 以尽量减少 L1 缓存悬浮, 并充分利用核心间位置 。 关键的想法是将多个 L1 缓存的标签阵列拆开并汇总起来, 以便将缓存请求与其他缓存中的所有标签阵列同时进行比较, 以探测其他缓存中复制的数据 。 GPU 缓存只有在复制数据存在时, 才能被其他 GPU 核心 访问, 过滤出不必要的缓存访问, 从而导致高资源争议 。 实验结果显示, GPU IPC 可以平均提高12%, 用于高核心间位置的应用 。