Real-time video inference on edge devices like mobile phones and drones is challenging due to the high computation cost of Deep Neural Networks. We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices. AMS uses a remote server to continually train and adapt a small model running on the edge device, boosting its performance on the live video using online knowledge distillation from a large, state-of-the-art model. We discuss the challenges of over-the-network model adaptation for video inference, and present several techniques to reduce communication cost of this approach: avoiding excessive overfitting, updating a small fraction of important model parameters, and adaptive sampling of training frames at edge devices. On the task of video semantic segmentation, our experimental results show 0.4--17.8 percent mean Intersection-over-Union improvement compared to a pre-trained model across several video datasets. Our prototype can perform video segmentation at 30 frames-per-second with 40 milliseconds camera-to-label latency on a Samsung Galaxy S10+ mobile phone, using less than 300 Kbps uplink and downlink bandwidth on the device.
翻译:由于深神经网络的计算成本高昂,对诸如移动电话和无人机等边缘装置的实时视频推断具有挑战性,因为深神经网络的计算成本很高。我们介绍了适应模型流(AMS),这是改进边缘装置视频推断高效光量模型性能的新方法。AMS使用远程服务器不断培训和改造在边缘装置上运行的小型模型,提高了其在现场视频上的性能,利用一个大型的、最先进的模型进行在线知识蒸馏的在线知识蒸馏。我们讨论了对视频推断进行超网络模型改造的挑战,并介绍了降低这一方法通信成本的若干技术:避免过度超配,更新一小部分重要模型参数,以及在边缘装置上对培训框架进行适应性取样。关于视频语义分割的任务,我们的实验结果表明,0.4-17.8%的跨视距比几个视频数据集的预先培训模型要高。我们的原型可以在每秒30个框架进行视频分解,每秒有40毫克的摄像到标签固定,在Samsung Galaxyle S10+移动电话上,使用较少的300-Klink的S10+移动电话。