The success of ChatGPT has recently attracted numerous efforts to replicate it, with instruction-tuning strategies being a key factor in achieving remarkable results. Instruction-tuning not only significantly enhances the model's performance and generalization but also makes the model's generated results more consistent with human speech patterns. However current research rarely studies the impact of different amounts of instruction data on model performance, especially in the real-world use cases. In this paper we explore the performance of large language models based on instruction tuning across different scales of instruction data. An evaluation dataset consisting of 12 major online use cases is constructed in the experiment. With Bloomz-7B1-mt as the base model, the results show that 1) merely increasing the amount of instruction data leads to continuous improvement in tasks such as open-ended generation, 2) in tasks such as math and code, the model performance curve remains quite flat while increasing data size. We further analyze the possible causes of these phenomena and propose potential future research directions such as effectively selecting high-quality training data, scaling base models and training methods specialized for hard tasks. We will release our training and evaluation datasets, as well as model checkpoints.
翻译:最近,ChatGPT 的成功吸引了大量的复制研究,而 instruction-tuning 策略是实现显著结果的关键因素。 instruction-tuning 不仅极大地提高了模型的性能和泛化能力,还使模型生成的结果更符合人类语言模式。然而,当前的研究很少研究不同训练数据规模对模型性能的影响,特别是在真实应用场景中。在本文中,我们探讨了基于 instruction-tuning 的大型语言模型在不同规模 instruction 数据下的性能。实验中构建了一个由 12 个主要在线应用场景组成的评估数据集。以 Bloomz-7B1-mt 为基础模型,结果表明:1)仅增加 instruction 数据量就可以在开放式生成等任务中产生持续的改进;2)在数学和代码等任务中,模型性能曲线保持相当平坦,而增加数据规模则没什么帮助。我们进一步分析了这些现象的可能原因,并提出了潜在的未来研究方向,例如有效地选择高质量的训练数据、对难度较大的任务进行基础模型和训练方法的扩展和改进等。我们将发布我们的训练和评估数据集以及模型检查点。