Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA -- and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline -- up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at https://sites.google.com/view/tag-emnlp.
翻译:近期对现代旗舰大型语言模型(LLM)有效上下文长度的研究表明,即使是最庞大且性能最卓越的模型体系,在处理长而复杂的上下文时,其有效问答与推理能力仍存在显著局限。尽管检索增强生成(RAG)和基于分块的重新排序等方法试图缓解此问题,但它们对文本分块策略、嵌入方法、检索模型及策略极为敏感,且依赖于繁重的预处理、知识获取与索引构建步骤。本文提出标签增强生成(TAG),一种轻量级数据增强策略,可在不损害或改变检索文档完整性与结构的前提下,提升LLM在长上下文场景中的性能。我们通过增强两个具有挑战性且直接相关的问答基准数据集——NoLima与NovelQA——验证了该假设,结果表明:对上下文进行标签标注,甚至仅在问答提示中添加标签定义,均能带来相较于基线的持续性能提升——在32K词元长度的上下文中提升幅度最高达17%,在需要跨广泛文本范围知识的多跳查询所涉及的复杂推理问答任务中提升2.9%。更多细节请访问:https://sites.google.com/view/tag-emnlp。