Tensor processing units (TPUs), specialized hardware accelerators for machine learning tasks, have shown significant performance improvements when executing convolutional layers in convolutional neural networks (CNNs). However, they struggle to maintain the same efficiency in fully connected (FC) layers, leading to suboptimal hardware utilization. In-memory analog computing (IMAC) architectures, on the other hand, have demonstrated notable speedup in executing FC layers. This paper introduces a novel, heterogeneous, mixed-signal, and mixed-precision architecture that integrates an IMAC unit with an edge TPU to enhance mobile CNN performance. To leverage the strengths of TPUs for convolutional layers and IMAC circuits for dense layers, we propose a unified learning algorithm that incorporates mixed-precision training techniques to mitigate potential accuracy drops when deploying models on the TPU-IMAC architecture. The simulations demonstrate that the TPU-IMAC configuration achieves up to $2.59\times$ performance improvements, and $88\%$ memory reductions compared to conventional TPU architectures for various CNN models while maintaining comparable accuracy. The TPU-IMAC architecture shows potential for various applications where energy efficiency and high performance are essential, such as edge computing and real-time processing in mobile devices. The unified training algorithm and the integration of IMAC and TPU architectures contribute to the potential impact of this research on the broader machine learning landscape.
翻译:张量处理单元(TPU)是机器学习任务的专用硬件加速器,在卷积神经网络(CNN)中执行卷积层时显示了显着的性能提升。但是,在全连接(FC)层中,它们难以保持相同的效率,导致子优化的硬件利用率。另一方面,内存模拟计算(IMAC)架构在执行密集层时显示了显着的加速。本文介绍了一种新型的、异构的、混合信号和混合精度架构,将IMAC单元与边缘TPU集成在一起,以增强移动CNN性能。为了利用TPU的卷积层优势和IMAC电路的密集层优势,我们提出了一种统一的学习算法,结合混合精度训练技术,以减轻在TPU-IMAC架构上部署模型时可能出现的精度降低问题。模拟结果表明,与传统TPU架构相比,TPU-IMAC配置可以实现多达2.59倍的性能提升和88%的内存减少,适用于各种CNN模型,同时保持可比的精度。TPU-IMAC架构在能量效率和高性能等方面的潜在应用,如边缘计算和移动设备实时处理。统一的训练算法和IMAC与TPU架构的集成有助于本研究对更广泛的机器学习领域产生影响。