Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the accelerator's effective performance. Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound. A further increase in effective bandwidth is possible by using burst rather than element-wise accesses, provided the data is contiguous in memory. In this paper, we propose a memory allocation technique, and provide a proof-of-concept source-to-source compiler pass, that enables such burst transfers by modifying the data layout in external memory. We assess how this technique pushes up the memory throughput, leaving room for exploiting additional parallelism, for a minimal logic overhead.
翻译:将计算密集的内核卸载到硬件加速器上,取决于这些平台提供的大量平行功能。 但是,内存界面的有效带宽往往造成瓶颈,妨碍加速器的有效性能。 使数据再利用的技术,例如平铺,降低了对记忆传输的压力,但仍然经常离开加速器I/O-就绪。 如果数据在记忆中相互连接,则使用爆破而非元素智能存取,就可以进一步提高有效带宽。 在本文中,我们建议采用记忆分配技术,并提供源对源校准编译器,通过修改外部记忆中的数据布局,使这种爆发性传输成为可能。我们评估这一技术如何将记忆推高,为利用额外的平行功能留下空间,以获得最低限度的逻辑管理。