大规模实用代码RAG：计算预算下的任务感知检索设计选择 (Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets)

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.

翻译：本研究在现实计算预算下探讨面向代码生成任务的检索设计。基于Long Code Arena中的两个互补任务——代码补全与缺陷定位——我们系统比较了不同上下文窗口大小下的检索配置，涵盖三个维度：(i)分块策略，(ii)相似性评分，以及(iii)分割粒度。研究发现：(1)对于PL-PL任务，采用词级分割的稀疏BM25方法最为高效实用，在保持数量级速度优势的同时显著优于稠密检索方案；(2)对于NL-PL任务，专用稠密编码器（Voyager-3系列）持续超越稀疏检索器，但需付出100倍的延迟代价；(3)最优分块大小随可用上下文规模变化：32-64行分块在小预算下效果最佳，而全文件检索在16000词元量级时具备竞争力；(4)基于行的简单分块在不同预算下均能达到与语法感知分割相当的效果；(5)不同配置间的检索延迟差异高达200倍：基于BPE的分割方案存在不必要的低速问题，而BM25+词级分割提供了最佳的质量-延迟权衡。本研究据此提出基于实证的建议，为根据任务需求、模型约束和计算效率构建高效代码导向RAG系统提供指导。