Neuralink：基于神经元共激活链接的智能手机快速大语言模型推理 (Neuralink: Fast LLM Inference on Smartphones with Neuron Co-Activation Linking)

Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Neuralink, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Neuralink leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize I/O efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Neuralink achieves on average $1.49\times$ improvements in end-to-end latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Neuralink explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design for LLM inference.

翻译：大语言模型（LLM）已在多个领域取得显著成功，但由于其巨大的计算和内存需求，在移动设备上的部署仍面临严峻挑战。虽然已开发出轻量级LLM以适应移动环境，但其模型精度往往有所下降。相比之下，基于稀疏性的技术通过仅选择性地将相关神经元传输至DRAM，同时将完整模型保留在外部存储（如闪存）中，从而最小化DRAM使用。然而，此类方法受到大量I/O操作的关键限制，尤其是在I/O操作次数（IOPS）受限严重的智能手机上。本文提出Neuralink，一种通过优化闪存中神经元布局来加速智能手机上LLM推理的新方法。Neuralink利用神经元共激活的概念，将频繁共同激活的神经元进行链接，以实现连续读取访问并优化I/O效率。我们的方法包含两阶段解决方案：离线阶段根据共激活模式重组神经元布局；在线阶段采用定制化的数据访问与缓存策略，以更好地适配硬件特性。在多种智能手机和LLM上进行的评估表明，与现有最优方法相比，Neuralink在端到端延迟上平均实现了$1.49\times$的提升。作为首个在稀疏性条件下优化存储布局的解决方案，Neuralink探索了LLM推理中稀疏性驱动算法与存储级系统协同设计交叉领域的新优化空间。