Deploying Large Language Models (LLMs) on mobile devices faces the challenge of insufficient performance in smaller models and excessive resource consumption in larger ones. This paper highlights that mobile Neural Processing Units (NPUs) have underutilized computational resources, particularly their matrix multiplication units, during typical LLM inference. To leverage this wasted compute capacity, we propose applying parallel test-time scaling techniques on mobile NPUs to enhance the performance of smaller LLMs. However, this approach confronts inherent NPU challenges, including inadequate hardware support for fine-grained quantization and low efficiency in general-purpose computations. To overcome these, we introduce two key techniques: a hardware-aware tile quantization scheme that aligns group quantization with NPU memory access patterns, and efficient LUT-based replacements for complex operations such as Softmax and dequantization. We design and implement an end-to-end inference system that leverages the NPU's compute capability to support test-time scaling on Qualcomm Snapdragon platforms. Experiments show our approach brings significant speedups: up to 19.0 for mixed-precision GEMM and 2.2 for Softmax. More importantly, we demonstrate that smaller models using test-time scaling can match or exceed the accuracy of larger models, achieving a new performance-cost Pareto frontier.
翻译:在移动设备上部署大型语言模型(LLM)面临以下挑战:较小模型性能不足,而较大模型资源消耗过高。本文指出,在典型的LLM推理过程中,移动神经处理单元(NPU)的计算资源(尤其是其矩阵乘法单元)未得到充分利用。为利用这些被浪费的计算能力,我们提出在移动NPU上应用并行测试时扩展技术以增强较小LLM的性能。然而,该方法面临NPU固有的挑战,包括对细粒度量化硬件支持不足以及通用计算效率低下。为克服这些困难,我们引入两项关键技术:一种硬件感知的片量化方案,使分组量化与NPU内存访问模式对齐;以及基于查找表的高效替代方案,用于处理Softmax和反量化等复杂运算。我们设计并实现了一个端到端推理系统,该系统利用NPU的计算能力,支持在高通骁龙平台上进行测试时扩展。实验表明,我们的方法带来了显著的加速效果:混合精度GEMM最高加速19.0倍,Softmax最高加速2.2倍。更重要的是,我们证明了采用测试时扩展的较小模型能够达到或超越较大模型的精度,从而实现了新的性能-成本帕累托前沿。