Near-data accelerators (NDAs) that are integrated with main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and the NDAs in a way that permits both regular memory access by some applications and accelerating others with an NDA, avoids copying data, enables collaborative processing, and simultaneously offers high performance for both host and NDA. We identify and solve new challenges in this context: mitigating row-locality interference from host to NDAs, reducing read/write-turnaround overhead caused by fine-grain interleaving of host and NDA requests, architecting a memory layout that supports the locality required for NDAs and sophisticated address interleaving for host performance, and supporting both packetized and traditional memory interfaces. We demonstrate our approach in a simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory modules. We show that our mechanisms enable effective and efficient concurrent access using a set of microbenchmarks, and then demonstrate the potential of the system for the important stochastic variance-reduced gradient (SVRG) algorithm.
翻译:与主记忆结合的近数据加速器(NDAs)具有巨大的动力和性能效益的潜力。这些效益的充分实现要求主机与非数据机之间共享大量可用的存储能力,以便通过一些应用程序定期存取存储能力,并加速其他应用程序使用NDA,避免复制数据,促成合作处理,同时为主机和NDA提供高性能。我们确定并解决这方面的新挑战:减轻主机到主机的行地权干扰,减少主机和NDA请求微微重分互换引起的读/翻转间接间接费用,设计一个支持主机所需地点的存储布局,以及用于主机性能的复杂地址互换,并支持包装式和传统存储界面。我们展示了我们采用由多核心的 CPU 和 NDADA 驱动的DM4 记忆模块组成的模拟系统的方法。我们表明,我们的机制能够利用一套微小断层标记使同时存取有效、高效的连接,然后展示系统的潜力,用于重要的千位变化变的梯算算法。