The energy efficiency of analog computing-in-memory (ACIM) accelerator for recurrent neural networks, particularly long short-term memory (LSTM) network, is limited by the high proportion of nonlinear (NL) operations typically executed digitally. To address this, we propose an LSTM accelerator incorporating an ACIM macro with reconfigurable (1-5 bit) nonlinear in-memory (NLIM) analog-to-digital converter (ADC) to compute NL activations directly in the analog domain using: 1) a dual 9T bitcell with decoupled read/write paths for signed inputs and ternary weight operations; 2) a read-word-line underdrive Cascode (RUDC) technique achieving 2.8X higher read-bitline dynamic range than single-transistor designs (1.4X better over conventional Cascode structure with 7X lower current variation); 3) a dual-supply 6T-SRAM array for efficient multi-bit weight operations and reducing both bitcell count (7.8X) and latency (4X) for 5-bit weight operations. We experimentally demonstrate 5-bit NLIM ADC for approximating NL activations in LSTM cells, achieving average error <1 LSB. Simulation confirms the robustness of NLIM ADC against temperature variations thanks to the replica bias strategy. Our design achieves 92.0% on-chip inference accuracy for a 12-class keyword-spotting task while demonstrating 2.2X higher system-level normalized energy efficiency and 1.6X better normalized area efficiency than state-of-the-art works. The results combine physical measurements of a macro unit-accounting for the majority of LSTM operations (99% linear and 80% nonlinear operations)-with simulations of the remaining components, including additional LSTM and fully connected layers.
翻译:用于循环神经网络(尤其是长短期记忆网络)的模拟存内计算加速器的能效,通常受限于非线性操作的高比例,这些操作通常以数字方式执行。为解决此问题,我们提出了一种LSTM加速器,其包含一个具有可重构(1-5比特)非线性存内模拟-数字转换器的模拟存内计算宏单元,以直接在模拟域中计算非线性激活函数,具体采用:1)一种双9T位单元,具有解耦的读写路径,用于有符号输入和三值权重操作;2)一种读字线下拉级联技术,其读取位线动态范围比单晶体管设计高2.8倍(相比传统级联结构,动态范围提升1.4倍,电流变化降低7倍);3)一种双电源6T-SRAM阵列,用于高效的多比特权重操作,并在5比特权重操作中同时减少了位单元数量(7.8倍)和延迟(4倍)。我们通过实验验证了用于近似LSTM单元中非线性激活函数的5比特非线性存内ADC,其平均误差小于1个最低有效位。仿真证实了得益于复制偏置策略,非线性存内ADC对温度变化具有鲁棒性。我们的设计在12类关键词检测任务中实现了92.0%的片上推理准确率,同时相比最先进的工作,系统级归一化能效提升了2.2倍,归一化面积效率提升了1.6倍。该结果结合了宏单元(占LSTM操作的绝大部分,包括99%的线性操作和80%的非线性操作)的物理测量,以及剩余组件(包括额外的LSTM层和全连接层)的仿真。