Instruction tuning enables language models to generalize more effectively and better follow user intent. However, obtaining instruction data can be costly and challenging. Prior works employ methods such as expensive human annotation, crowd-sourced datasets with alignment issues, or generating noisy examples via LLMs. We introduce the LongForm dataset, which is created by leveraging English corpus examples with augmented instructions. We select a diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset and one suitable for long text generation. We finetune T5, OPT, and LLaMA models on our dataset and show that even smaller LongForm models have good generalization capabilities for text generation. Our models outperform 10x larger language models without instruction tuning on various tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin. Finally, our models can effectively follow and answer multilingual instructions; we demonstrate this for news generation. We publicly release our data and models: https://github.com/akoksal/LongForm.
翻译:指令调整(instruction tuning)可以使语言模型更有效地进行泛化,并更好地遵循用户意图。然而,获取指令数据可能是昂贵且具有挑战性的。先前的工作采用了一些方法,例如昂贵的人工注释、具有对齐问题的众包数据集或者通过 LLMs 生成噪声示例。本文介绍了 LongForm 数据集,它是通过利用英文语料库示例并对其进行增强指令而创建的。我们从现有的语料库(如 C4 和维基百科)中选择多样的人类书写文档,并通过 LLMs 为这些文档生成指令。这种方法提供了一种更便宜且更干净的指令调整数据集,并且适用于长文本生成。我们在 LongForm 数据集上对 T5、OPT 和 LLaMA 模型进行微调,并表明即使是较小的 LongForm 模型在文本生成方面具有良好的泛化能力。我们的模型在各种任务(如故事/食谱生成和长篇问答)上胜过了 10 倍大的语言模型无指令调整模型。此外,LongForm 模型也胜过了先前的指令调整模型 Flan-T5 和 Alpaca。最后,我们的模型可以有效地遵循和回答多语言指令;我们展示了新闻生成的案例。我们公开发布我们的数据和模型:https://github.com/akoksal/LongForm 。