The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules and mapping strategies provided by the accelerator developers for typical tensor operations are not necessarily the best possible ones for each particular use case. This is why frameworks which automatically test the performance of the generated code on a specific hardware configuration are of special interest. In this work, the integration between the code generation framework TVM and the systolic array-based accelerator Gemmini is presented. A generic schedule to offload the GEneral Matrix Multiply (GEMM) tensor operation onto Gemmini is detailed, and its suitability is tested by executing the AutoTVM tuning process on it. Our generated code achieves a peak throughput of 46 giga-operations per second (GOPs) under a 100 MHz clock on a Xilinx ZCU102 FPGA, outperforming previous work. Furthermore, the code generated by this integration was able to surpass the default hand-tuned schedules provided by the Gemmini developers in real-world workloads.
翻译:由于缺少为这些系统提供的端对端软件工具,在多式 SoCs 上部署神经网络以及定制加速器是一项艰巨的任务,此外,由加速器开发者为典型高压操作提供的现有低水平时间表和绘图战略不一定是每个特定使用案例的最佳可能战略。这就是为什么自动测试特定硬件配置生成代码的性能的框架具有特殊兴趣的原因。在这项工作中,提出了代码生成框架TVM与基于系统阵列的加速器Gemmini之间的整合。将GEMM(GEMM)的高压操作卸载到Gemmini的通用时间表非常详细,其适用性通过实施自动TVM调控程序测试。我们生成的代码在Xilinx ZCU102 FPGA的100 Mh时钟下每秒实现46千兆-操作(GOPs)的峰值峰值,超过了以往的工作。此外,这一整合生成的代码能够超过GEMIM开发商提供的默认手控时间表。