Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.
翻译:自然语言处理( NLP) 推论显示移动应用程序正在越来越多地采用移动式应用程序, 移动式应用程序的在线推论对于关键保存用户数据隐私和避免网络循环来说是可取的。 然而, NLP 模式的空前规模既能强调衬里, 也能强调记忆, 造成移动设备两个关键资源之间的紧张关系。 要达到目标宽度, 尽快将整个模型放在记忆前执行中, 并增加一个应用程序的内存足迹数倍, 在移动式内存管理回收之前, 将它的好处限制在几处缓冲。 另一方面, 从需求存储中装入模型的模型将产生IO值, 只要几秒钟, 远超过用户满意的延迟范围; 管道层式模型的装载和执行不会隐藏 IO, 因为 IO 和计算延迟之间的高度偏差。 为此, 我们提议快速变压器( STI) 以最大程度的IO/ 度资源利用率为关键想法, 在模型中, 将透明性操作比值比值比值比值比值比值更低一些。