Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-cola with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid cross-contamination in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pre-training in BERT. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices.
翻译:对今天的大型语言模型(LLMS)的有效培训取决于大批量和长序列,以达到输送量和准确性。要处理硬件加速器的变长序列,通常的做法是采用粘贴符号,使批量中的所有序列的长度相同。我们在本文件中显示,通用 NLP 数据集的序列长度变化可以使所有标牌中多达50%的序列能够被粘贴。在较少常见但并非极端的案例中(例如,序列长度为128的GLUE-cola),比率高达89 % 。由于需要避免在自省中交叉污染,避免在排序信息丢失时降低准确性,或者仅对特定加速器使用定制的内核执行,从而降低准确性,从而使得处理由此造成的效率低下的现有方法变得复杂。本文介绍了在经过仔细研究的垃圾包装问题的背景下对序列包装进行新的正规化,并介绍了基于这一配方的新算法,例如,为第二阶段的包装前模型和经过培训的模型之间授予了2x速度。我们展示的是,如何将现有的模型与经培训的模型改成。