探究训练数据规模对大型语言模型的影响：基于真实应用场景的实证研究 (Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases) - 专知论文

会员服务 ·

0

训练数据 · 实证研究 · 大型语言模型 · 模型性能 · 语言模型 ·

2023 年 3 月 26 日

Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases

翻译：探究训练数据规模对大型语言模型的影响：基于真实应用场景的实证研究

Yunjie Ji,Yong Deng,Yan Gong,Yiping Peng,Qiang Niu,Lei Zhang,Baochang Ma,Xiangang Li

The success of ChatGPT has recently attracted numerous efforts to replicate it, with instruction-tuning strategies being a key factor in achieving remarkable results. Instruction-tuning not only significantly enhances the model's performance and generalization but also makes the model's generated results more consistent with human speech patterns. However current research rarely studies the impact of different amounts of instruction data on model performance, especially in the real-world use cases. In this paper we explore the performance of large language models based on instruction tuning across different scales of instruction data. An evaluation dataset consisting of 12 major online use cases is constructed in the experiment. With Bloomz-7B1-mt as the base model, the results show that 1) merely increasing the amount of instruction data leads to continuous improvement in tasks such as open-ended generation, 2) in tasks such as math and code, the model performance curve remains quite flat while increasing data size. We further analyze the possible causes of these phenomena and propose potential future research directions such as effectively selecting high-quality training data, scaling base models and training methods specialized for hard tasks. We will release our training and evaluation datasets, as well as model checkpoints.

翻译：最近，ChatGPT 的成功吸引了大量的复制研究，而 instruction-tuning 策略是实现显著结果的关键因素。 instruction-tuning 不仅极大地提高了模型的性能和泛化能力，还使模型生成的结果更符合人类语言模式。然而，当前的研究很少研究不同训练数据规模对模型性能的影响，特别是在真实应用场景中。在本文中，我们探讨了基于 instruction-tuning 的大型语言模型在不同规模 instruction 数据下的性能。实验中构建了一个由 12 个主要在线应用场景组成的评估数据集。以 Bloomz-7B1-mt 为基础模型，结果表明：1）仅增加 instruction 数据量就可以在开放式生成等任务中产生持续的改进；2）在数学和代码等任务中，模型性能曲线保持相当平坦，而增加数据规模则没什么帮助。我们进一步分析了这些现象的可能原因，并提出了潜在的未来研究方向，例如有效地选择高质量的训练数据、对难度较大的任务进行基础模型和训练方法的扩展和改进等。我们将发布我们的训练和评估数据集以及模型检查点。

3

相关内容

训练数据

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

专知会员服务

78+阅读 · 2023年5月8日

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

专知会员服务

139+阅读 · 2023年4月27日

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

40+阅读 · 2022年10月10日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

纳米银与细菌和有机物在地下水中的相互作用机理及共迁移

国家自然科学基金

0+阅读 · 2014年12月31日

基于风险能量的BT债务规模边界及风险控制有效性研究

国家自然科学基金

0+阅读 · 2014年12月31日

黄河源区径流量对地表覆盖动态变化的响应研究

国家自然科学基金

0+阅读 · 2013年12月31日

EAST高功率低杂波与边界等离子体非线性相互作用的机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

复杂地表TSAR反演北方/半北方森林垂直结构方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体衬底上FeSe薄膜的外延生长及界面超导

国家自然科学基金

0+阅读 · 2013年12月31日

卫星观测研究气溶胶对中国降水垂直结构的影响

国家自然科学基金

0+阅读 · 2012年12月31日

基于SAS数据的水下复杂场景中目标识别研究

国家自然科学基金

1+阅读 · 2012年12月31日

一类随机偏微分方程解的存在唯一性和渐近性质

国家自然科学基金

0+阅读 · 2012年12月31日

地下工程施工扰动带土体剪切特性试验研究及对地面变形的影响

国家自然科学基金

0+阅读 · 2008年12月31日

An Empirical Study on the Language Modal in Visual Question Answering

Arxiv

0+阅读 · 2023年5月17日

Knowledge Graph Completion Models are Few-shot Learners: An Empirical Study of Relation Labeling in E-commerce with LLMs

Arxiv

0+阅读 · 2023年5月17日

What Makes Pre-trained Language Models Better Zero-shot Learners?

Arxiv

0+阅读 · 2023年5月16日

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

Arxiv

0+阅读 · 2023年5月15日

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

Arxiv

0+阅读 · 2023年5月15日

Large Language Model Guided Tree-of-Thought

Arxiv

1+阅读 · 2023年5月15日

ZARA: Improving Few-Shot Self-Rationalization for Small Language Models

Arxiv

0+阅读 · 2023年5月12日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

473+阅读 · 2023年3月31日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

VIP会员

文章信息

相关主题

大型语言模型

相关VIP内容

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

专知会员服务

78+阅读 · 2023年5月8日

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

专知会员服务

139+阅读 · 2023年4月27日

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

40+阅读 · 2022年10月10日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能绝不能完全自主》

《人工智能的法律与伦理：军事自主机器独特挑战的深度剖析》316页

从数据到主导：AI与兵棋推演构筑决策优势

《特洛伊木马货柜：武器化集装箱的战略威胁》最新报告

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

An Empirical Study on the Language Modal in Visual Question Answering

Arxiv

0+阅读 · 2023年5月17日

Knowledge Graph Completion Models are Few-shot Learners: An Empirical Study of Relation Labeling in E-commerce with LLMs

Arxiv

0+阅读 · 2023年5月17日

What Makes Pre-trained Language Models Better Zero-shot Learners?

Arxiv

0+阅读 · 2023年5月16日

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

Arxiv

0+阅读 · 2023年5月15日

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

Arxiv

0+阅读 · 2023年5月15日

Large Language Model Guided Tree-of-Thought

Arxiv

1+阅读 · 2023年5月15日

ZARA: Improving Few-Shot Self-Rationalization for Small Language Models

Arxiv

0+阅读 · 2023年5月12日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

473+阅读 · 2023年3月31日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

相关基金

纳米银与细菌和有机物在地下水中的相互作用机理及共迁移

国家自然科学基金

0+阅读 · 2014年12月31日

基于风险能量的BT债务规模边界及风险控制有效性研究

国家自然科学基金

0+阅读 · 2014年12月31日

黄河源区径流量对地表覆盖动态变化的响应研究

国家自然科学基金

0+阅读 · 2013年12月31日

EAST高功率低杂波与边界等离子体非线性相互作用的机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

复杂地表TSAR反演北方/半北方森林垂直结构方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体衬底上FeSe薄膜的外延生长及界面超导

国家自然科学基金

0+阅读 · 2013年12月31日

卫星观测研究气溶胶对中国降水垂直结构的影响

国家自然科学基金

0+阅读 · 2012年12月31日

基于SAS数据的水下复杂场景中目标识别研究

国家自然科学基金

1+阅读 · 2012年12月31日

一类随机偏微分方程解的存在唯一性和渐近性质

国家自然科学基金

0+阅读 · 2012年12月31日

地下工程施工扰动带土体剪切特性试验研究及对地面变形的影响

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员