Language model pre-training and derived methods are incredibly impactful in machine learning. However, there remains considerable uncertainty on exactly why pre-training helps improve performance for fine-tuning tasks. This is especially true when attempting to adapt language-model pre-training to domains outside of natural language. Here, we analyze this problem by exploring how existing pre-training methods impose relational structure in their induced per-sample latent spaces -- i.e., what constraints do pre-training methods impose on the distance or geometry between the pre-trained embeddings of two samples $\vec x_i$ and $\vec x_j$. Through a comprehensive review of existing pre-training methods, we find that this question remains open. This is true despite theoretical analyses demonstrating the importance of understanding this form of induced structure. Based on this review, we introduce a descriptive framework for pre-training that allows for a granular, comprehensive understanding of how relational structure can be induced. We present a theoretical analysis of this framework from first principles and establish a connection between the relational inductive bias of pre-training and fine-tuning performance. We also show how to use the framework to define new pre-training methods. We build upon these findings with empirical studies on benchmarks spanning 3 data modalities and ten fine-tuning tasks. These experiments validate our theoretical analyses, inform the design of novel pre-training methods, and establish consistent improvements over a compelling suite of baseline methods.
翻译:语言模式培训前和派生方法对机器学习的影响极大。然而,对于为什么培训前有助于改进改进微调任务的业绩,还存在着相当大的不确定性。在试图将语言模式培训前培训调整到自然语言以外的领域时,这一点尤其正确。在这里,我们通过探讨现有的培训前方法和派生的人均潜在空间如何将关系结构强加给现有的培训前方法和派生方法 -- -- 即培训前方法对预先培训的两种样本的嵌入之间的距离或几何限制是什么。我们从最初的原则出发对这一框架进行理论分析,并通过对现有培训前方法进行全面审查,发现这一问题仍然未解决。尽管理论分析表明理解这种诱导结构形式的重要性,这是确实存在的。根据这次审查,我们为培训前的每个潜在空间引入了一个描述性框架,以便能够对如何引导出关系结构有一个简单和全面的理解。我们从最初的原则出发对这一框架进行理论分析,并在培训前和调整前业绩的关系偏差之间建立起联系。我们还发现,我们如何利用这些实验性基准来制定新的标准。我们还展示了如何利用这些实验性框架来界定新的标准。