An increasing number of applications rely on complex inference tasks that are based on machine learning (ML). Currently, there are two options to run such tasks: either they are served directly by the end device (e.g., smartphones, IoT equipment, smart vehicles), or offloaded to a remote cloud. Both options may be unsatisfactory for many applications: local models may have inadequate accuracy, while the cloud may fail to meet delay constraints. In this paper, we present the novel idea of \emph{inference delivery networks} (IDNs), networks of computing nodes that coordinate to satisfy ML inference requests achieving the best trade-off between latency and accuracy. IDNs bridge the dichotomy between device and cloud execution by integrating inference delivery at the various tiers of the infrastructure continuum (access, edge, regional data center, cloud). We propose a distributed dynamic policy for ML model allocation in an IDN by which each node dynamically updates its local set of inference models based on requests observed during the recent past plus limited information exchange with its neighboring nodes. Our policy offers strong performance guarantees in an adversarial setting and shows improvements over greedy heuristics with similar complexity in realistic scenarios.
翻译:越来越多的应用依赖基于机器学习(ML)的复杂推论任务。目前,有两个选项可以直接由终端装置(例如智能手机、IoT设备、智能车辆)直接为它们服务,或者卸载到远程云层。两种选项对于许多应用程序来说可能不尽如人意:本地模型可能不够准确,而云层可能无法满足延迟限制。在本文件中,我们介绍了基于机器学习(IDNs)的新颖概念,即计算用于协调满足 ML推论的节点网络,要求实现最小值和准确度之间的最佳交易。IDN通过在基础设施连续作业的各个层次(接入、边缘、区域数据中心、云层)整合推断交付,将设备与云层执行之间的对齐连接起来。我们提出了在IDN中分配ML模型分配的动态政策,根据最近观察到的请求,每个不动态更新其本地的推论模型,再加上与其邻近的节点进行有限的信息交流。我们的政策在现实的对抗性假设中提供了强有力的性能保证。