深层数据效率:通过高效的数据抽样和运行,提高深层学习示范质量和培训效率 (DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing)

from arxiv, Equal contribution by the first 3 authors. Code has been released as a part of https://github.com/microsoft/DeepSpeed. Part of this paper is from our previous arxiv report (arXiv:2211.11586)

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two novel data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. DeepSpeed Data Efficiency also takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.

翻译：深层次学习模型的最近进展是以惊人的培训成本为代价的。日益扩大的模型规模是一个根本原因,但另一个不那么强调的事实是,数据规模实际上正在以与模型规模相似的速度增长,而培训费用则与两者成正比。与迅速变化的模式结构相比,如何高效使用培训数据(特别是昂贵的基础模型预培训),由于缺少一个以数据效率为重点的方便框架,因此探索范围较小,难以实现。为此,我们提出深层版数据效率,这是一个更好地利用数据、提高培训效率和提高模型质量的框架。具体地说,我们提出并结合了两种新的数据效率技术:通过普通课程学习图书馆进行有效的数据取样,通过新的随机的层状抛弃技术进行高效的数据路由。深层版数据效率也不太深入、灵活性和兼容性,因此用户可以很容易地利用这一框架来配置多种技术并应用定制的战略。通过对GPT-3 1.3B和BERT大语言模型进行更好的预培训,我们可以实现类似的时间质量,在2x数据质量方面达到类似程度,在2个较少的时间范围内,在2个数据质量方面实现类似的模型。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/