Accelerators implementing Deep Neural Networks for image-based object detection operate on large volumes of data due to fetching images and neural network parameters, especially if they need to process video streams, hence with high power dissipation and bandwidth requirements to fetch all those data. While some solutions exist to mitigate power and bandwidth demands for data fetching, they are often assessed in the context of limited evaluations with a scale much smaller than that of the target application, which challenges finding the best tradeoff in practice. This paper sets up the infrastructure to assess at-scale a key power and bandwidth optimization - weight clustering - for You Only Look Once v3 (YOLOv3), a neural network-based object detection system, using videos of real driving conditions. Our assessment shows that accelerators such as systolic arrays with an Output Stationary architecture turn out to be a highly effective solution combined with weight clustering. In particular, applying weight clustering independently per neural network layer, and using between 32 (5-bit) and 256 (8-bit) weights allows achieving an accuracy close to that of the original YOLOv3 weights (32-bit weights). Such bit-count reduction of the weights allows shaving bandwidth requirements down to 30%-40% of the original requirements, and reduces energy consumption down to 45%. This is based on the fact that (i) energy due to multiply-and-accumulate operations is much smaller than DRAM data fetching, and (ii) designing accelerators appropriately may make that most of the data fetched corresponds to neural network weights, where clustering can be applied. Overall, our at-scale assessment provides key results to architect camera-based object detection accelerators by putting together a real-life application (YOLOv3), and real driving videos, in a unified setup so that trends observed are reliable.
翻译:执行深神经网络的加速器,用于基于图像的天体探测,由于获取图像和神经网络参数,特别是在需要处理视频流的情况下,执行深神经网络参数的大量数据运行。因此,由于需要处理视频流,因此在获取所有这些数据时需要高功耗和带宽要求。虽然存在一些解决方案来减轻对数据采集的电力和带宽需求,但它们往往在有限评价的背景下进行评估,其规模远小于目标应用,在实际操作中难以找到最佳取舍。本文建立基础设施,以便在规模上评估一个关键电力和带宽优化 - 重量组合 - 重组合 - 用于“你只看一次 V3 (YOLOv3), 一个基于神经网络的物体探测系统系统,使用真实驱动条件的视频。我们的评估显示,诸如具有输出稳定性结构结构的系统阵列等加速器是一个非常有效的解决方案,同时使用重量组合。 特别是,在神经网络层层中独立应用重量组合,使用32 (5比) 和256 (8比) 级滚动的重量重量重量重量组合, 能够实现原始的精确接近YOL-O3 设计重量应用的物体应用重量应用的操作的精确值应用, 30(32- 比重) 使原始的计算数据可以降低一个原始的能量值数据到原始的计算,这样使得原始的能量值的计算到原始的计算到原始的能量值的能量值的能量值数据到原始的计算, 使原始的能量值的能量值能降低到原始的计算, 使原始的温度值的计算值为30。</s>