Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs a few seconds long IO, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the large skewness between IO and computation delays. To this end, we propose WRX. Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, WRX reconciles the latency/memory tension via two novel techniques. First, model sharding. WRX manages model parameters as independently tunable shards and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. WRX instantiates an IO/computation pipeline and uses a small buffer for preload shards to bootstrap execution without stalling in early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, which maximizes inference accuracy. Atop two commodity SoCs, we build WRX and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that, WRX delivers high accuracies with 1--2 orders of magnitude lower memory, outperforming competitive baselines.
翻译:自然语言处理( NLP) 推论显示移动应用程序正在越来越多地采用移动式应用程序, 移动式应用程序的在线推论对于关键保存用户数据隐私和避免网络循环来说是可取的。 然而, NLP 模型的空前规模既强调衬里, 也强调存储器的记忆, 这是移动设备的两个关键资源。 要达到目标悬浮, 尽快将整个模型放在记忆启动执行中, 但将一个应用程序的记忆足迹增加几次, 在移动存储管理回收之前, 将它的好处限制在少数几个缓冲点上。 另一方面, 从需求存储中装入模型需要时, 需要几秒钟的准确性IO, 大大超过用户满足的延迟范围; 管道式层模式式模型的装载和执行也并不隐藏IO, 因为IO 和计算延迟。 为此, 我们建议 WRX 将IO 和 Orights 最高级部分的资源利用最大化, WRX 将一个最高级的IO/ 和最高级部分用于一个最高级的IO值, 通过两个新的执行技术来调调 。