Contextualized representations based on neural language models have furthered the state of the art in various NLP tasks. Despite its great success, the nature of such representations remains a mystery. In this paper, we present an empirical property of these representations -- "average" approximates "first principal component". Specifically, experiments show that the average of these representations shares almost the same direction as the first principal component of the matrix whose columns are these representations. We believe this explains why the average representation is always a simple yet strong baseline. Our further examinations show that this property also holds in more challenging scenarios, for example, when the representations are from a model right after its random initialization. Therefore, we conjecture that this property is intrinsic to the distribution of representations and not necessarily related to the input structure. We realize that these representations empirically follow a normal distribution for each dimension, and by assuming this is true, we demonstrate that the empirical property can be in fact derived mathematically.
翻译:以神经语言模型为基础的上下文表达方式促进了各种神经语言任务的最新水平。尽管取得了巨大成功,但这种表达方式的性质仍是一个谜。在本文中,我们介绍了这些表述方式的经验属性 -- -- “平均”近似于“第一主要组成部分”。具体地说,实验表明,这些表述方式的平均值与矩阵的第一主要组成部分具有几乎相同的方向,而矩阵的第一主要组成部分就是这些表达方式。我们认为,这解释了为什么平均表达方式始终是一个简单而有力的基准。我们进一步的研究显示,这种属性也存在于更具挑战性的情形中,例如,这种表达方式是在随机初始化之后从一个模型中产生的。因此,我们推测,这种属性与表述方式的分布是内在的,不一定与输入结构有关。我们认识到,这些表述方式在经验上遵循了每个方面正常的分配方式,我们假设这一点是真实的,我们证明经验属性实际上可以从数学上得出。