The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60\% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
翻译:移动AI加速器的崛起使得对延迟敏感的应用能够在客户端轻量级深度神经网络(DNN)上执行。然而,关键应用程序需要强大的模型,而边缘设备无法承载,因此必须卸载请求,而高维数据将争夺有限的带宽。本文建议将焦点转向以机器可解释性为优化点的变分压缩,而不是集中在执行分区DNN的浅层。我们介绍了一种新颖的资源感知压缩模型框架,并在反映边缘设备和服务器之间不对称资源分配的环境中进行了广泛的评估。我们的方法实现了60%的比特率比最先进的SC方法低,而不降低准确性,并且比现有编解码器标准卸载快16倍。