Learning vector representations for words is one of the most fundamental topics in NLP, capable of capturing syntactic and semantic relationships useful in a variety of downstream NLP tasks. Vector representations can be limiting, however, in that typical scoring such as dot product similarity intertwines position and magnitude of the vector in space. Exciting innovations in the space of representation learning have proposed alternative fundamental representations, such as distributions, hyperbolic vectors, or regions. Our model, Word2Box, takes a region-based approach to the problem of word representation, representing words as $n$-dimensional rectangles. These representations encode position and breadth independently and provide additional geometric operations such as intersection and containment which allow them to model co-occurrence patterns vectors struggle with. We demonstrate improved performance on various word similarity tasks, particularly on less common words, and perform a qualitative analysis exploring the additional unique expressivity provided by Word2Box.
翻译:语言的学习矢量表达方式是国家语言平台中最基本的主题之一,能够捕捉下游国家语言平台各项任务中有用的综合和语义关系。不过,矢量表达方式可能会限制典型的评分,如圆形产品相似性双极位置和空间矢量的大小等典型评分。在代表性学习空间中令人振奋的创新提出了其他基本表达方式,如分布、双向矢量或区域。我们的Word2Box模型对单词表达方式问题采取了基于区域的方法,代表了以美元为维的矩形的单词。这些表达方式独立地将位置和广度编码,并提供了额外的几何操作,例如交叉和封闭性,使其能够模拟共同发生模式矢量的争斗。我们展示了不同词相似性任务,特别是较不常见的单词的绩效,并进行了定性分析,探索Word2Box提供的另外独特的表达方式。