As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 52% and 34% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench, RepoEval, and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.
翻译:随着大语言模型在现实应用中的日益普及,处理极长上下文(通常超出模型预训练时的上下文长度限制)已成为一项关键挑战。尽管现有的高效长上下文处理方法展现出潜力,但基于循环压缩的方法难以有效保留信息,而随机访问方法则需要大量内存资源。我们提出了REFORM,一种新颖的推理框架,通过两阶段方法高效处理长上下文。首先,它增量处理输入块,同时维护压缩的KV缓存,构建跨层上下文嵌入,并利用早退策略提升效率。其次,它通过相似性匹配识别并收集关键令牌,并选择性重计算KV缓存。与基线方法相比,在1M上下文长度下,REFORM在RULER和BABILong上分别实现了超过52%和34%的性能提升。在Infinite-Bench、RepoEval和MM-NIAH基准测试中,REFORM也优于基线方法,展现了跨不同任务和领域的灵活性。此外,REFORM将推理时间减少了30%,峰值内存使用降低了5%,实现了效率与卓越性能的双重提升。