加速加强以区域NNN为基础的关于混合FP16-INT8培训后量化的多科混合MCU的演讲能力 (Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization)

This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually-managed memory transfers of model parameters. To ensure minimal accuracy degradation with respect to the full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit while the bit precision of remaining layers is kept to FP16. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters. Thanks to the proposed approaches, we speed-up the computation by up to 4x with respect to the lossless FP16 baselines. Differently from a uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the power cost of the external memory by fitting the large models on the limited on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by reducing the supply voltage from 0.8V to 0.65V while still matching the real-time constraints. Our design results 10x more energy efficient than state-of-the-art SE solutions deployed on single-core MCUs that make use of smaller models and quantization-aware training.

翻译：本文以常规神经网络(RNN)为基础,用1+8的通用RISC-V核心来设计和部署具有1+8个通用RISC-V核心的高级微缩控制股(MCU)的语音增强算法。为了实现低延度执行,我们建议采用优化的软件管道连接,同时计算LSTM或GRU的常规区块,以矢量化的8位整数(INT8)和16位浮动点(FP16)来计算单元,同时人工管理模型参数的存储传输。为了确保全精度精度模型的精确度降低,我们提议采用新的FP16-INT8-MIC 混合精度训练后再降低精度。我们提议采用0-INTER5的精度不精确度, 将RFS-1的精度调整为低度的流力化, 将SE-RU-RU的精度比精度比模型再降低到1.24M参数。由于采用拟议的方法,我们加快了40级的精确度模型,我们加快了40级的计算结果,我们从40级的精度模型再计算, 将精度的精度的精度的精度提高到8级的精度提高到的精度的精度,我们从10级的精度的精度再降低的精度, 的精度将精度的精度的精度再提高的精度机率的精度,我们的精度到10的精度-I的精度, 的精度的精度, 的精度将的精度再降低的精度再降低。