Large Language Models (LLMs) have shown impressive results on a variety of text understanding tasks. Search queries though pose a unique challenge, given their short-length and lack of nuance or context. Complicated feature engineering efforts do not always lead to downstream improvements as their performance benefits may be offset by increased complexity of knowledge distillation. Thus, in this paper we make the following contributions: (1) We demonstrate that Retrieval Augmentation of queries provides LLMs with valuable additional context enabling improved understanding. While Retrieval Augmentation typically increases latency of LMs (thus hurting distillation efficacy), (2) we provide a practical and effective way of distilling Retrieval Augmentation LLMs. Specifically, we use a novel two-stage distillation approach that allows us to carry over the gains of retrieval augmentation, without suffering the increased compute typically associated with it. (3) We demonstrate the benefits of the proposed approach (QUILL) on a billion-scale, real-world query understanding system resulting in huge gains. Via extensive experiments, including on public benchmarks, we believe this work offers a recipe for practical use of retrieval-augmented query understanding.
翻译:大型语言模型(LLMS)在各种文本理解任务上表现出了令人印象深刻的结果。搜索询问虽然是一个独特的挑战,但由于时间短且缺乏细微度或背景。复杂的特征工程工作并不总是导致下游改进,因为其绩效效益可能因知识蒸馏的日益复杂而抵消。因此,在本文件中,我们作出以下贡献:(1) 我们证明检索增加查询为LMS提供了宝贵的额外环境,从而增进了理解。检索增加通常会增加LMS的延缓性(从而损害蒸馏效率 ), (2) 我们提供了一种实际和有效的提炼回收增加LMS的方法。具体地说,我们使用一种新型的两阶段蒸馏方法,使我们能够在不承受通常与它相关的增加的拼凑的情况下,将检索增加的收益带过来。(3) 我们展示了拟议的方法(QUILLLS)在10亿规模的、真实世界查询系统上的好处,从而带来巨大的收益。通过广泛的实验,包括公共基准实验,我们认为这项工作为实际使用检索理解查询提供了一种食谱。