We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios (e.g., retrieved documents). LLMA first selects a text span from the reference and copies its tokens to the decoder and then efficiently checks the tokens' appropriateness as the decoding result in parallel within one decoding step. The improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs with identical generation results as greedy decoding in many practical generation scenarios where significant overlap between in-context reference and outputs exists (e.g., search engines and multi-turn conversations).
翻译:我们提出了LLMA,一种基于参考数据的大型语言模型加速器,可在不丢失信息的情况下加速 LLM 推理。LLMA 的设计动机在于观察到在许多真实场景(例如,检索文档)中,LLM 的解码结果和可用的参考文本之间存在丰富的相同文本片段。LLMA 首先从参考文本中选择一个文本片段,并将其标记为 LLMA。然后,LLMA 在一步解码的同时高效地检查标记的令牌是否适合作为解码结果。这种改进的计算并行性允许 LLMA 在许多实际生成场景(例如,搜索引擎和多轮对话)中实现超过2倍的 LLM 速度提升,同时保持与贪心解码相同的生成结果。