In this paper, we present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES design offers multiple configuration options that trade off parallelism and complexity using a dataflow model to create four stages that read, compute, scale and write results. FADES is mapped to the programmable logic (PL) and integrated with the TensorFlow Lite inference engine running on the processing system (PS) of a heterogeneous SoC device. The accelerator is used to compute the tensor operations, while the dynamically reconfigurable approach can be used to switch precision between int8 and float modes. This dynamic reconfiguration enables better performance by allowing more cores to be mapped to the resource-constrained device and lower power consumption compared with supporting both arithmetic precisions simultaneously. We compare the proposed hardware with a high-performance systolic architecture for dense matrices obtaining 25% better performance in dense mode with half the DSP blocks in the same technology. In sparse mode, we show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.
翻译:在本文中,我们提出了一种名为FADES(Fused Architecture for DEnse and Sparse matrices)的动态可重构硬件加速器。FADES设计提供多种配置选项,通过数据流模型来创建四个阶段,分别是读取、计算、缩放和写入结果。FADES被映射到可编程逻辑(PL)并与运行在异构SoC设备的处理系统(PS)上的TensorFlow Lite推理引擎集成。加速器用于计算张量操作,而动态可重构方法可以用于在int8和浮点模式之间切换精度。这种动态重配置通过允许将更多核心映射到资源受限的设备上,从而实现更好的性能,且相比同时支持两种算术精度,可以降低功耗。我们将所提出的硬件与适用于密集矩阵的高性能收缩阵列体系结构进行对比,在相同技术下,在密集模式下获得了25%更好的性能,且使用了一半的DSP块。在稀疏模式下,我们表明该核心甚至可以在低稀疏度水平下胜过密集模式,并且单核心可以在软件优化的NEON RUY库上实现高达20倍的加速。