We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on FPGAs. By extending the hls4ml library, we demonstrate an inference latency of $5\,\mu$s using convolutional architectures, targeting microsecond latency applications like those at the CERN Large Hadron Collider. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various methods for model compression in order to fit the computational constraints of a typical FPGA device used in trigger and data acquisition systems of particle detectors. In particular, we discuss pruning and quantization-aware training, and demonstrate how resource utilization can be significantly reduced with little to no loss in model accuracy. We show that the FPGA critical resource consumption can be reduced by 97% with zero loss in model accuracy, and by 99% when tolerating a 6% accuracy degradation.
翻译:我们引入了一种自动工具,用于在FPGAs上部署超低延迟、低功率的深神经网络和富集层。 通过扩展hls4ml图书馆,我们展示了5美元的推论时间,使用CERN大型散子相撞器等结构,针对微秒悬浮应用,如CERN大型散子相撞机。考虑到在街景房数字数据集中培训的基准模型,我们展示了各种模型压缩方法,以适应在粒子探测器触发和数据采集系统中使用的典型FPGA装置的计算限制。特别是,我们讨论了对粒子探测器的运行和量化认知培训,并演示如何在模型精度方面少少少少少一点、零少一点的情况下大大减少资源的利用。我们显示,FPGA的关键资源消耗量可以减少97%,模型精度为零损失,在减缓6%精度降解时则减少99%。