Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
翻译:生成式大语言模型凭借其出色的情境学习能力,已在众多工业应用中取得显著成功。然而,复杂任务中的长情境问题对其更广泛的应用构成了重大障碍,主要体现在两个方面:(i)过长的情境导致高昂的计算成本和推理延迟;(ii)长情境引入的大量任务无关信息加剧了“迷失于中段”问题。现有方法通常通过自信息或PPL等指标去除冗余标记来实现情境压缩,这种做法与在给定查询条件下保留最重要标记的目标存在偏差。本研究引入信息瓶颈理论对该问题进行建模,提供了一个全新的视角,系统性地满足了情境压缩所需的核心特性。此外,我们提出了一种基于交叉注意力的方法来近似信息瓶颈中的互信息,该方法可根据不同场景灵活替换为合适的替代方案。在四个数据集上的大量实验表明,我们的方法在保持问答性能的同时,压缩率比现有最优方法提升了25%。值得注意的是,在某些情况下,经我们方法压缩后的情境甚至优于完整情境的表现。