Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.
翻译:了解自然语言处理模型潜在空间的几何特性,可以对这些特性进行操纵,以改进下游任务的业绩。这种属性之一是模型潜在空间中的数据传播量,或现有潜在空间的充分利用情况。在这项工作中,我们界定数据传播情况,并表明通常使用的数据传播量度、平均相近度和最小/最大偏差函数比I(V),不能提供可靠的衡量尺度,比较各种模型中潜在空间的使用情况。我们提议并研究八种数据传播量度的替代度量,除了其中一种以外,所有数据分布都比目前7种合成数据分布的计量值改进。我们建议的措施包括一个主要基于组成部分的计量和一个基于酶的计量,提供可靠、相对的分布度,并可用于比较不同大小和维度的模型。