Pre-trained language models (LMs), such as BERT (Devlin et al., 2018) and its variants, have led to significant improvements on various NLP tasks in past years. However, a theoretical framework for studying their relationships is still missing. In this paper, we fill this gap by investigating the linear dependency between pre-trained LMs. The linear dependency of LMs is defined analogously to the linear dependency of vectors. We propose Language Model Decomposition (LMD) to represent a LM using a linear combination of other LMs as basis, and derive the closed-form solution. A goodness-of-fit metric for LMD similar to the coefficient of determination is defined and used to measure the linear dependency of a set of LMs. In experiments, we find that BERT and eleven (11) BERT-like LMs are 91% linearly dependent. This observation suggests that current state-of-the-art (SOTA) LMs are highly "correlated". To further advance SOTA we need more diverse and novel LMs that are less dependent on existing LMs.
翻译:培训前语言模型(LM),如BERT(Devlin等人,2018年)及其变体等,在过去几年中,各种NLP任务有了重大改进。然而,研究其关系的理论框架仍然缺乏。在本文件中,我们通过调查培训前LMS之间的线性依赖性来填补这一差距。LMS的线性依赖性与矢量的线性依赖性相似。我们提议语言模型分解(LMD)代表LM,以其他LMs的线性组合为基础,并得出封闭式解决方案。对于LMD,定义并使用类似于确定系数的、适合LMD的尺度来衡量一组LMS的线性依赖性。在实验中,我们发现BERT和11(11)BERT)类LMs的线性依赖性依赖性为91%。这一观察表明,目前最先进的LMs(SOTA)是高度“相关”的。为了进一步推进SOTA,我们需要更加多样化和新颖的LMSMSMS。