Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
翻译:语言模型通过测试时缩放技术(如最佳N采样和树搜索)在推理任务中展现出卓越能力。然而,这些方法通常需要大量计算资源,导致性能与效率之间存在关键权衡。本文提出STAND(随机自适应N元语法草拟),一种新颖的无模型推测解码方法,利用推理轨迹中固有的冗余性,在不牺牲准确性的前提下实现显著加速。我们的分析表明,推理路径频繁复用相似的推理模式,从而无需独立草拟模型即可实现高效的无模型词元预测。通过引入随机草拟机制,并借助内存高效的基于对数概率的N元语法模块保留概率信息,结合优化的Gumbel-Top-K采样与数据驱动的树结构构建,STAND显著提升了词元接受率。在多个模型与推理任务(AIME-2024、GPQA-Diamond和LiveCodeBench)上的广泛评估表明,与标准自回归解码相比,STAND将推理延迟降低了60-65%,同时保持准确性。此外,STAND在多样化推理模式(包括单轨迹解码、批量解码和测试时树搜索)中均持续优于最先进的推测解码方法。作为一种无模型方法,STAND无需额外训练即可应用于任何现有语言模型,成为加速语言模型推理的强大即插即用解决方案。