Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root cause, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pertaining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency library, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, it provides efficient data sampling via curriculum learning, and efficient data routing via random layerwise token dropping. DeepSpeed Data Efficiency takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-Large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.
翻译:最近深层次学习模型的进展是以惊人的培训成本为代价的。 不断增大的模型规模是一个根本原因,但另一个不那么强调的事实是,数据规模实际上以与模型规模相似的速度增长,而培训成本与两者成正比。 与迅速变化的模型结构相比,如何高效使用培训数据(特别是昂贵的基础模型),由于缺少一个以数据效率能力为重点的方便框架,因此探索较少,也难以实现。 为此,我们介绍了深层数据效率库,这是一个更好地利用数据、提高培训效率和提高模型质量的框架。具体地说,它通过课程学习提供高效的数据抽样,通过随机的层状图示下降提供高效的数据路由。深层数据效率考虑到可扩展性、灵活性和可兼容性,以便用户能够很容易地利用这一框架来构思多种技术和应用定制的战略。 通过对GPT-3 1.3B和BERT-Large语言模型的培训前应用我们的解决方案,我们可以实现类似的模型质量,通过2x更少的数据和2x时间,或者在类似的数据数量下实现更好的模型质量。