Even with generational improvements in DRAM technology, memory access latency still remains the major bottleneck for application accelerators, primarily due to limitations in memory interface IPs which cannot fully account for variations in target applications, the algorithms used, and accelerator architectures. Since developing memory controllers for different applications is time-consuming, this paper introduces a modular and programmable memory controller that can be configured for different target applications on available hardware resources. The proposed memory controller efficiently supports cache-line accesses along with bulk memory transfers. The user can configure the controller depending on the available logic resources on the FPGA, memory access pattern, and external memory specifications. The modular design supports various memory access optimization techniques including, request scheduling, internal caching, and direct memory access. These techniques contribute to reducing the overall latency while maintaining high sustained bandwidth. We implement the system on a state-of-the-art FPGA and evaluate its performance using two widely studied domains: graph analytics and deep learning workloads. We show improved overall memory access time up to 58% on CNN and GCN workloads compared with commercial memory controller IPs.
翻译:即使在DRAM技术的代际改进下,记忆存取延缓度仍然是应用加速器的主要瓶颈,这主要是由于记忆接口IP的局限性,无法充分说明目标应用程序、使用的算法和加速器结构的差异。由于为不同应用程序开发记忆控制器耗时,本文件引入了一个模块和可编程的存储控制器,可以对现有硬件资源的不同目标应用程序配置模块和可编程的存储控制器。拟议的存储控制器有效地支持缓存线访问以及批量记忆传输。用户可以根据FPGA、记忆存取模式和外部记忆规格方面的现有逻辑资源配置控制器。模块设计支持各种记忆访问优化技术,包括请求列表、内部缓存和直接存取。这些技术有助于减少总体的延迟,同时保持高持续性的带宽度。我们使用一个最先进的FPGA系统,并使用两个广泛研究的领域来评估其性能:图解分析和深层学习工作量。我们发现CNN和GCN工作量的总体记忆访问时间提高到58%,而与商业记忆控制器相比,我们发现与商业存储控制器相比,我们改进了58%。