Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two novel data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. DeepSpeed Data Efficiency also takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.
翻译:深层次学习模型的最近进展是以惊人的培训成本为代价的。 日益扩大的模型规模是一个根本原因,但另一个不那么强调的事实是,数据规模实际上正在以与模型规模相似的速度增长,而培训费用则与两者成正比。 与迅速变化的模式结构相比,如何高效使用培训数据(特别是昂贵的基础模型预培训),由于缺少一个以数据效率为重点的方便框架,因此探索范围较小,难以实现。为此,我们提出深层版数据效率,这是一个更好地利用数据、提高培训效率和提高模型质量的框架。具体地说,我们提出并结合了两种新的数据效率技术:通过普通课程学习图书馆进行有效的数据取样,通过新的随机的层状抛弃技术进行高效的数据路由。深层版数据效率也不太深入、灵活性和兼容性,因此用户可以很容易地利用这一框架来配置多种技术并应用定制的战略。通过对GPT-3 1.3B和BERT大语言模型进行更好的预培训,我们可以实现类似的时间质量,在2x数据质量方面达到类似程度,在2个较少的时间范围内,在2个数据质量方面实现类似的模型。