DNNs are ubiquitous on edge devices nowadays. With its increasing importance and use cases, it's not likely to pack all DNNs into device memory and expect that each inference has been warmed up. Therefore, cold inference, the process to read, initialize, and execute a DNN model, is becoming commonplace and its performance is urgently demanded to be optimized. To this end, we present NNV12, the first on-device inference engine that optimizes for cold inference NNV12 is built atop 3 novel optimization knobs: selecting a proper kernel (implementation) for each DNN operator, bypassing the weights transformation process by caching the post-transformed weights on disk, and pipelined execution of many kernels on asymmetric processors. To tackle with the huge search space, NNV12 employs a heuristic-based scheme to obtain a near-optimal kernel scheduling plan. We fully implement a prototype of NNV12 and evaluate its performance across extensive experiments. It shows that NNV12 achieves up to 15.2x and 401.5x compared to the state-of-the-art DNN engines on edge CPUs and GPUs, respectively.
翻译:目前,DNNS在边缘装置上是无处不在的。 随着它越来越重要和使用案例越来越多, 它不太可能把所有DNNS都装进设备内存, 并期望每个推论都变暖了。 因此, 冷推断、 读、 初始化和执行 DNN 模型的过程正在变得司空见惯, 并且迫切需要优化它的性能。 为此, 我们提出NNNV12, 这是首个最优化冷发 NNNV12 的在离子推力机上最优化的首个NNV12 。 我们完全实施了 NNV12 原型, 并评估了它在整个实验中的性能: 为每个DNNNNN操作操作操作者选择一个合适的内核( 实施), 绕过重量转换过程, 将磁盘上的后转式重量缩放在磁盘上, 并在不对称的处理器上管道中执行许多内核内核。 为了应对巨大的搜索空间, NNVVA12 采用了基于超光基计划, 以获得近最佳的内核内核计划。 我们实施了NV12 模型, 并评估它在广泛的实验中的性能- 和4015. 1NNPU, 。 它分别在DNV12 和4015x 和GNPU- PUS 上达到15x 。