Video analytics pipelines have steadily shifted to edge deployments to reduce bandwidth overheads and privacy violations, but in doing so, face an ever-growing resource tension. Most notably, edge-box GPUs lack the memory needed to concurrently house the growing number of (increasingly complex) models for real-time inference. Unfortunately, existing solutions that rely on time/space sharing of GPU resources are insufficient as the required swapping delays result in unacceptable frame drops and accuracy violations. We present model merging, a new memory management technique that exploits architectural similarities between edge vision models by judiciously sharing their layers (including weights) to reduce workload memory costs and swapping delays. Our system, GEMEL, efficiently integrates merging into existing pipelines by (1) leveraging several guiding observations about per-model memory usage and inter-layer dependencies to quickly identify fruitful and accuracy-preserving merging configurations, and (2) altering edge inference schedules to maximize merging benefits. Experiments across diverse workloads reveal that GEMEL reduces memory usage by up to 60.7%, and improves overall accuracy by 8-39% relative to time/space sharing alone.
翻译:视频分析管道已稳步转向边缘部署,以减少带宽间接费用和侵犯隐私的行为,但与此同时,面临越来越多的资源紧张。最明显的是,边缘箱GPU缺乏同时容纳越来越多的实时推断模型(日益复杂)所需的记忆。不幸的是,依赖时间/空间共享GPU资源的现有解决方案不足,因为所需的交换延迟导致不可接受的框架下降和准确性违约。我们展示了模型合并,一种新的记忆管理技术利用边缘视觉模型之间的建筑相似性,明智地分享其层(包括重量)以减少工作量存储成本和交换延迟。我们的系统GEMEL有效地将整合到现有的管道中,其方法是:(1) 利用关于每个模型记忆使用情况和跨层依赖性之间的若干指导性观测,以迅速确定成果和准确性,保留合并配置,(2) 改变边缘推论时间表,以最大限度地实现合并效益。我们通过不同工作量的实验发现,GMEL将记忆使用减少高达60.7%,并将总体准确率提高8-39 %,仅与时间/空间共享相比。