The recent success of distributed word representations has led to an increased interest in analyzing the properties of their spatial distribution. Current metrics suggest that contextualized word embedding models do not uniformly utilize all dimensions when embedding tokens in vector space. Here we argue that existing metrics are fragile and tend to obfuscate the true spatial distribution of point clouds. To ameliorate this issue, we propose IsoScore: a novel metric which quantifies the degree to which a point cloud uniformly utilizes the ambient vector space. We demonstrate that IsoScore has several desirable properties such as mean invariance and direct correspondence to the number of dimensions used, which are properties that existing scores do not possess. Furthermore, IsoScore is conceptually intuitive and computationally efficient, making it well suited for analyzing the distribution of point clouds in arbitrary vector spaces, not necessarily limited to those of word embeddings alone. Additionally, we use IsoScore to demonstrate that a number of recent conclusions in the NLP literature that have been derived using brittle metrics of spatial distribution, such as average cosine similarity, may be incomplete or altogether inaccurate.
翻译:最近分布式文字表达方式的成功导致人们对分析其空间分布特性的兴趣增加。 目前的指标显示,在矢量空间嵌入符号时,背景化的字嵌入模型没有统一使用所有维度。 在这里,我们争辩说,现有的量度是脆弱的,往往模糊了点云的真正空间分布。为了缓解这一问题,我们建议IsoScore : 一种创新的量度,它量化了点云统一利用环境矢量空间的程度。 我们表明, IsoScore 具有若干可取的属性,例如平均变化和直接对应所使用维度的数量,而这是现有分数所不具备的特性。 此外, IsoScore在概念上是直观的,在计算上效率很高,非常适合分析任意矢量空间点云的分布,而不限于单词嵌入的云。 此外,我们使用IsoScore 来表明,利用空间分布的微量指标(例如平均相近度)最近得出的一些结论可能是不完整或完全不准确的。