Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.
翻译:语言模式培训前的先前工作探索了不同的结构和学习目标,但数据、超参数和评价方面的差异使原则性比较变得困难。在这项工作中,我们注重双向性,将其作为区别现有方法的一个关键因素,并全面研究其在下一个象征性预测、文本填充、零射线和微调中的作用。我们提议了一个新框架,将先前方法,包括完全单向模式(如GPT)、完全双向模型(如BERT)、完全双向模型(如BERT)和混合模型(如CM3和前置LM)加以概括。我们的框架区分双向性(双向背景和双向关注)这两个概念,并使我们能够分别控制其中每一个概念。我们发现,最佳配置在很大程度上取决于应用预测(例如,双向关注有利于微调和填充,但不利于下一种象征性预测和零射线)。我们训练了高达6.7B参数的模型,并发现差异在规模上保持不变。我们以前关于按比例缩放的两种概念(双向背景和双向关注)两个概念,而我们以前的工作则注重于左向的大幅调整模式。