While coarse-grained reconfigurable arrays (CGRAs) have emerged as promising programmable accelerator architectures, pipelining applications running on CGRAs is required to ensure high maximum clock frequencies. Current CGRA compilers either lack pipelining techniques resulting in low performance or perform exhaustive pipelining resulting in high energy and resource consumption. We introduce Cascade, an application pipelining toolkit for CGRAs, including a CGRA application frequency model, automated pipelining techniques for CGRA application compilers that work with both dense and sparse applications, and hardware optimizations for improving application frequency. Cascade enables 7 - 34x lower critical path delays and 7 - 190x lower EDP across a variety of dense image processing and machine learning workloads, and 2 - 4.4x lower critical path delays and 1.5 - 4.2x lower EDP on sparse workloads, compared to a compiler without pipelining.
翻译:虽然粗糙的可调整阵列已成为有希望的可编程加速器结构,但CGRA系统需要运行管道应用以确保最高时速的频率。目前CGRA的编译者要么缺乏造成低性能和高资源消耗的管线技术,要么无孔不入的管线技术,为CGRA系统引入了一个应用管线工具包,包括一个CGRA应用频率模型,为CGRA应用程序的编译者提供自动管线技术,为CGRA应用程序的编译者与密集和稀少的应用程序一起工作,以及改进应用频率的硬件优化。在各种密集图像处理和机器学习工作量方面,Clascade提供了7 - 34x次低临界路延缓和7 - 190x低 EDP,以及2 - 4.4x低临界路延缓和1.5 - 4.2x低干量的EDP,比没有管道的编译者低1.5 - 4.2 EDP。