During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.
翻译:在边缘装置部署深神经网络(DNNS)期间,许多研究工作都集中在有限的硬件资源上,然而,很少注意动态电源管理的影响。由于边缘装置通常只有电池能源预算(而不是服务器或工作站上几乎无限制的能源支持),因此其动态电源管理往往改变执行频率,如同广泛使用的动态电压和频率缩放(DVFS)技术一样。这导致高度不稳定的推导速度性能,特别是计算密集的DNNN模型,这可能会损害用户的经验和浪费硬件资源。我们首先确定这一问题,然后提出一个具有高度代表性的全在一线运行框架,用DVFS进行动态电源管理。这个框架只能使用一套模型重量和软面罩(连同其他可忽略的辅助参数)来代表多种运行比率的多种模式。通过重新配置模型,只能根据特定的执行频率(和电流流流)的模型调整相应的运行速度比,我们能够实现稳定的推导速度,也就是说,在各种频率下进行不同频率的快速性变换,而且只能以不同频率进行不同的方式进行。