Image processing and machine learning applications benefit tremendously from hardware acceleration, but existing compilers target either FPGAs, which sacrifice power and performance for flexible hardware, or ASICs, which rapidly become obsolete as applications change. Programmable domain-specific accelerators have emerged as a promising middle-ground between these two extremes, but such architectures have traditionally been difficult compiler targets. The main obstacle is that these accelerators often use a different memory abstraction than CPUs and GPUs: push memories that send a data stream from one computation kernel to other kernels, possibly reordered. To address the compilation challenges caused by push memories, we propose that the representation of memory in the middle and backend of the compiler be altered to combine storage with address generation and control logic in a single structure -- a unified buffer. We show that this compiler abstraction can be implemented efficiently on a programmable accelerator, and design a memory mapping algorithm that combines polyhedral analysis and software vectorization techniques to target our accelerator. Our evaluation shows that the compiler supports programmability while maintaining high performance. It can compile a wide range of image processing and machine learning applications to our accelerator with 4.7x better runtime and 4.3x better energy-efficiency as compared to an FPGA.
翻译:硬件加速使图像处理和机器学习应用程序受益匪浅,但现有编译者的目标要么是FPGA,因为FPGA为灵活硬件牺牲了能量和性能,要么是ASIC,随着应用程序的改变而迅速过时。可编程的域特定加速器已成为这两个极端之间充满希望的中间地带,但这类结构历来是难以编译的目标。主要障碍是这些加速器通常使用不同的内存抽象器,而不是CPU和GPU:将将从一个计算内核发送数据流的记忆推到其他内核,可能重新排序。为了应对由于推动记忆而导致的编译挑战,我们建议修改编译器中间和后端的内存表示,将存储与地址生成和控制逻辑结合起来 -- -- 一个统一的缓冲。我们表明,这些编译器的抽象化器可以用一个可编程式加速器来高效地执行,并设计一个将一个存储器式分析器和软传导技术结合到我们的加速器。我们的评价显示,为了应对由推动存储存储存储存储器的中后端端和图像,我们可以用一个更好的速度来进行更好的编程。