Two major techniques are commonly used to meet real-time inference limitations when distributing models across resource-constrained IoT devices: (1) model parallelism (MP) and (2) class parallelism (CP). In MP, transmitting bulky intermediate data (orders of magnitude larger than input) between devices imposes huge communication overhead. Although CP solves this problem, it has limitations on the number of sub-models. In addition, both solutions are fault intolerant, an issue when deployed on edge devices. We propose variant parallelism (VP), an ensemble-based deep learning distribution method where different variants of a main model are generated and can be deployed on separate machines. We design a family of lighter models around the original model, and train them simultaneously to improve accuracy over single models. Our experimental results on six common mid-sized object recognition datasets demonstrate that our models can have 5.8-7.1x fewer parameters, 4.3-31x fewer multiply-accumulations (MACs), and 2.5-13.2x less response time on atomic inputs compared to MobileNetV2 while achieving comparable or higher accuracy. Our technique easily generates several variants of the base architecture. Each variant returns only 2k outputs 1 <= k <= (#classes/2), representing Top-k classes, instead of tons of floating point values required in MP. Since each variant provides a full-class prediction, our approach maintains higher availability compared with MP and CP in presence of failure.
翻译:通常使用两种主要技术来在资源受限制的IoT设备中分配模型时满足实时推断限制:(1) 模型平行(MP) 和(2) 类平行(CP)。在MP中,在设备之间传输大中型数据(数量大于投入的数量)需要巨大的通信间接费用。虽然CP解决了这个问题,但对于子模型的数量却有限制。此外,两种解决方案都是不耐烦症,这是在边缘装置上部署时的一个问题。我们提议了不同的平行(VP),一种基于共同的深层次学习分配方法,其中生成了一种主要模型的不同变体,并且可以部署在不同的机器上。我们在原始模型上设计了一个更轻的模型系列,同时培训它们以提高单一模型的准确性。我们在六个共同的中小型物体识别数据集上的实验结果表明,我们的模型的参数可能减少5.8-7.1x,4.3-3-31x乘积(MACs)减少倍积(MACs),在原子投入上2.5-13.2x的响应时间比移动网络2,同时实现可比或更高的准确性。我们的技术很容易产生几种变式,在基础结构中的几种变数,每个变数,每个变数(K)仅代表了最高变式的模型,每个变数,每个变数,每个变数为Kxx。